adrianhesketh.com

Idempotency in Lambda - 1 - What is it and why should I care?

This is part 1 of a 3 part series.

What is it, and why should I care?

Idempotence is the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application. https://en.wikipedia.org/wiki/Idempotence

Real-world processes can often be modelled as “finite state machines”, for example if a retail order is “created” by some process, some work needs to be done to get that order through to a “complete” stage. Not all transitions of state are valid - we can’t take a created order and get to the complete stage without going through some extra states, like payment_started, payment_completed, picking_started, picking_completed, dispatching_started, dispatching_complete, delivery_started, delivery_complete, complete.

The movement from _started to _completed usually requires something to happen - making an API call to a 3rd party, moving some phyical items etc. - and some data is generated as a result, maybe a tracking code from the 3rd party API, the date and time of when things started/completed and what did the work.

We can use event-driven systems to manage this, starting with an API call.

Step 1 - handle an API request and store the state change.

  • Receive an “order created” API call or order_created event.
  • Write a database transaction that has no effect or throws an error if the order already exists.
    • Writing the database transaction causes an order_created event to be published, e.g. via DynamoDB Streams to EventBridge.

Step 2 - start payment with the Stripe API.

It’s tempting to make an API call out to Stripe in the same function that creates the order, but executing multiple actions in a single execution unit makes it possible that some of the actions succeed, but others fail or don’t execute, without rolling back. This is a partially committed transaction - we could find ourselves with an order that started, but hasn’t got a payment in place, and no way of retrying to get that to happen.

So, let’s keep it to one side-effect or state change per unit of execution by creating a new Lambda function to handle the order_created event. The Lambda function does one thing - calls the Stripe API.

This has several benefits; it moves the execution out of the synchronous API call which reduces the latency of the “create order” API call, it stops the “create order” from crashing if Stripe is down or there’s a network problem, and it enables automatic retries of the Stripe API call if there’s a problem.

  • Receive the order_created event.
    • Use the Stripe API to create a payment intent.

Step 3 - receive asynchronous updates via Webhooks.

The Stripe API uses Webhooks to report back state changes, so we’ll receive them and use them to change our state.

  • Receive the Stripe payment_intent.created Webhook.
    • Write a database transaction that updates the state from order_created to payment_started - fail if the order is not in order_created state, and don’t update the state if the order is already in a payment_completed or later state.
      • Writing the database transaction successfully causes an payment_started event to be published, e.g. via DynamoDB Streams to EventBridge.

Once the user completes the process, we’ll receive the payment_succeeded webhook, but there’s also a chance that we get a payment_succeeded webhook before we get the created webhook. To make sure we’re getting things in the right order, we can reject the webhook until we’ve received the payment_intent.created webhook first by throwing an error to force Stripe to retry later. Or, we could just accept that we’ve skipped a notification.

  • Receive a payment_succeeded webhook from Stripe.
    • Write a database transaction that updates the state from payment_started to payment_completed and stores information about the event.
      • Make sure that the transaction fails if the order is not in the payment_started state, or that the transaction has no effect if a duplicate event has been received.
      • Writing the database transaction causes an payment_completed event to be published, e.g. via DynamoDB Streams to EventBridge.

This kind of asynchronous processing is ideal because it enables automated retries and keeps processing simple - your code is receiving an event and making an API call, or is receiving an event and updating the state (which causes another event to be sent).

Asynchronous vs synchronous APIs

However, not all APIs are asynchronous. In some cases, we will be forced to call synchronous APIs - APIs which rely on the client storing a value provided by the API. In these cases, we need to do two things - call an API and save the data. This leads to a potential error state where we call the API successfully, but are unable to save the data.

In some APIs, this is fine. Let’s imagine we upload a file to an S3 bucket with a random filename, if we do that 10 times, we’ll spend a bit more on S3 storage, but it’s not really a problem. However, if we’ve just emailed a customer or spent a lot of money because of that API call, it’s not so fine. Since Lambda events are retried on failure, it’s a problem because retrying will result in calling the API again due to the database failure.

To complicate matters, some synchronous APIs (e.g. AWS and Stripe APIs) have idempotency features built in that enable retries to be safe in limited circumstances - if you pass the same idempotency tokens into them, you always get the same output (terms and conditions apply). If the idempotency window aligns with your need, this sometimes enables a shortcut to be taken by making it safe to make an API call followed by a database save operation, but it’s not a pattern that can be applied safely everywhere - care must be taken to do it right. These APIs are described in part 3.

In a situation where we’re not priovided with an idempotent API by a 3rd party, we can use “once, and only-once” processing to turn it into an idempotent API and protect the underlying API from being called multiple times at the cost of some extra database calls, and management overhead to deal with errors. This is described in part 2.

Idempotency in Lambda

Even without database failures, if we’re using non-idempotent APIs, we may run into issues because events delivered by AWS services such as SQS, EventBridge and Kinesis to Lambda have “at least once” delivery. This means that Lambda functions or other systems subscribed to these sources may end up receiving a message twice, sometimes within a few milliseconds of each other.

AWS has a guide on dealing with this, but at the time of writing, the guide at https://aws.amazon.com/premiumsupport/knowledge-center/lambda-function-idempotent/ provides example logic that doesn’t cover all of the possible edge cases that can result in duplicate processing.

The guide suggests the following:

  1. Extract the value of a unique attribute of the input event. (For example, a transaction or purchase ID.)
  2. Check if the attribute value exists in a control database (such as an Amazon DynamoDB table).
  3. If a unique value exists (indicating a duplicated event), gracefully terminate the execution (that is, without throwing an error). If a unique value doesn’t exist, continue the execution normally.
  4. When the function work finishes successfully, include a record in the control database.
  5. Finish the execution.

One problem with this is that it’s possible for Lambda functions to be invoked within milliseconds of each other with the same payload. Checking to see if a value exists at the start of the invocation, and then only preventing other invocations from doing the same work after the current invocation has completed the work can result in a race condition - a situation where two Lambda invocations both believe they’re the only invocation carrying out the work.

Another problem is that it assumes something about the Lambda function - it assumes the Lambda function is only executing idempotent APIs. That is, it assumes that it’s safe to run the function if there’s no value in the control database. Let’s look at some examples of where it wouldn’t be.

Failed API call

In this example, there’s API Call A, and API Call B in the same Lambda function. Here’s what happens when there’s a failure and a message is retried.

  • Lambda invocation 1
    • Control database get: No token found
    • API Call A: Success (call 1)
    • API Call B: Error, quit with Lambda error
    • Control database write: N/A, we already quit
  • Lambda invocation 2 (the retry)
    • Control database get: No token found
    • API Call A: Success (call 2)
    • API Call B: Success
    • Control database write: Written successfully

If API Call A is not idempotent, then we may have introduced a serious problem.

Failed database write after processing

This same problem would occur in the scenario where the control database write failed for some reason, even if only had one API call in the Lambda.

  • Lambda invocation 1
    • Control database get: No token found
    • API Call A: Success (call 1)
    • API Call B: Success (call 1)
    • Control database write: Failed to write
  • Lambda invocation 2 (the retry)
    • Control database get: No token found
    • API Call A: Success (call 2)
    • API Call B: Success (call 2)
    • Control database write: Written successfully

In this case, it’s even worse, API Call A and API Call B were both called twice, so if either of them were not idempotent calls, we potentially have a problem.

Next

Part 2 looks at a way to create idempotent APIs using DynamoDB.