This is part 3 of a 3 part series.
- Part 1 - What’s the problem and why should I care?
- Part 2 - Using DynamoDB to manage once-only functionality
- Part 3 - AWS and Stripe APIs that support idempotency.
Some APIs and services provide us with capabilities to help deal with idempotency. However, each service still requires careful consideration because the features are implemented quite differently. Where we can, it’s a good idea to use the idempotency features of the APIs, rather than needing to manage the various “once-and-only-once” issues.
Many AWS services accept a
ClientRequestToken parameter, including Athena, Secrets Manager, EC2, Code Commit and more. This allows duplicate calls (as identified by using the same
ClientRequestToken for the service within a timeout period) to AWS services to be ignored.
The behaviour of this parameter is different depending on the service. For example, at the time of writing, DynamoDB has a relatively short maximum length of 36 for its ClientRequestToken, compared to the 64 character limit of Amazon Chime and AWS Secrets Manager.
64 characters is just enough to store a SHA256 hash, whereas the 36 maximum character maximum length of DynamoDB’s ClientRequestToken is only big enough to store a hex-encoded of an MD5 hash.
The duration of the ClientRequestToken window is not specified in every service, which may mean that it’s permanent in some services, but only temporary in others - the idempotency window can be as little as 5 minutes. This can make the built-in functionality unsuitable for use cases where a duplicate message or retry might happen outside of that window.
To use the ClientRequestToken with DynamoDB, you need to use the
TransactWriteItems API call rather than the simpler
PutItem API call.
The token must be a string of between 1 and 36 characters, which is just enough for a UUID. However, another option is to use the hex encoded md5 hash of an inbound message (32 characters).
DynamoDB idempotency keys are only valid within a 10 minute window, so they only protect against retries within a relatively short period of time.
AWS Step Functions
Step Functions are idempotent by default based on the name of the execution rather than a
ClientRequestToken parameter. The window period is 90 days.
Step Functions are really helpful for handling cases where lots of things have to happen once. There’s a good blog post on the saga pattern at https://theburningmonk.com/2017/07/applying-the-saga-pattern-with-aws-lambda-and-step-functions/
The ability to catch errors and to do long-running compensation tasks and retries makes them a good solution for executing multiple Lambda functions, but it’s important that you’ve designed your Step Function to be able to restart at failed states. There’s a good blog post on that at https://aws.amazon.com/blogs/compute/resume-aws-step-functions-from-any-state/
SQS first-in-first-out (FIFO) queues offers filtering of duplicate messages sent within a 5 minute window, based on a deduplication ID that must be included in the message. https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/using-messagededuplicationid-property.html
A useful deduplication ID might be something like
orderid_123_event_1 - something that is based on the message content and meaning. If the system sending messages to SQS just uses random UUIDs each time, then of course, the deduplication won’t work at all, because every time a message is sent, it will get a different ID, even though the message content may be identical to a previous message.
However, even though putting messages into an SQS FIFO queue results in de-deduplication, it doesn’t mean that the message will only be delivered once. For example, if a Lambda consumer crashes halfway through processing the message, it will be delivered again.
Stripe is a good example of well written API that supports idempotency. In the Stripe API, an idempotency key parameter is accepted as a HTTP header, documented at https://stripe.com/docs/api/idempotent_requests
The idempotency window of this service is 24 hours, and the documentation recommends using UUIDs, which are 36 characters long.
It’s a nice feature for customers of their API to implement the idempotency capability, since it enables automatic retries to be safely handled. For example, a common pattern with Stripe is to create a payment intent, then store that record in a local database. If the creation of the intent succeeds, but the database storage fails, using this feature, we’re able retry the action without creating two payment intents for the same order since each call with the same idempotency key gets the same response within a 24 hour period from the first request.
Idempotency is a useful featue of APIs that can reduce operational overheads and enable safe retries for services, if all of the APIs within an operation supports it and the events happen within an appropriate window of time.
However, even within the same suite of AWS, there’s variation in implementation details that needs to be thought through.
When dealing with operations that are not idempotent and that require once-and-only once execution, we may need to introduce a lock token to prevent multiple processes from duplicating the work, but this introduces some extra monitoring and support requirements to make sure that no task is started, but not completed, and that tasks are not left in an inconsistent state.
If we’re looking to build systems that are reliable at high volume, it’s a good idea to break things down into steps that can be executed individually rather than having multiple side-effect calls (API calls, database writes) in a single Lambda function. This can be done by using a Step Function, or dividing up work by using DynamoDB streams (writing to the database, then using that side effect to trigger further Lambda operations).
It’s also a good idea to make sure that work can be retried easily to reduce the amount of labour involved in handling errors because a small percentage of errors in a huge number of jobs can still be a big number of errors.