adrianhesketh.com

Idempotency in Lambda - 2 - Dealing with it

This is part 2 of a 3 part series.

Making a Lambda function or API idempotent

The strategy here is to ensure that a protected resource (our API call) is only used once. To do this, we need to create a lock token, modelled on the railway token

A railway token is a physical item that is still in use on some railway sytems today (I saw one being used during a signal fault on my local railway a year or two ago). Drivers are only permitted to enter a section of track if they are in physical posession of a special token. This prevents multiple trains from driving on the track at once, which is very much something you don’t want to happen.

To be allowed onto a section of track, the train driver must get hold of the token, to leave the section of track, the driver must hand the token back.

We’ll do something similar - we’ll write a library that allows a section of code (ideally containing just one API call) to be protected by a lock. Clients will attempt to acquire the lock with a unique key (lockToken). If a client acquires the lock, it can enter the section of code and must store the result back to the table. If the client can’t acquire the lock, it will either get back an error or the result of the execution - it’s receiving the result of the execution that makes the code idempotent.

DynamoDB is a good choice for this sort of task, so we’ll start by creating a table.

Creating a lock token DynamoDB table

The DynamoDB table is structured like this:

  • lockToken
  • lockedBy
  • created
  • expectedCompletion
  • actualCompletion
  • result
  • ttl
  • status

The lockToken field stores the idempotency key and is the partition key of the table. The field value would be something like order_12345, or the cryptographic hash of a message. If a message hash is used, then it’s important to carefully analyse the fields to make sure that there’s not a sentDate, retryCount, or other field that would result in duplicate messages resulting in different hashes.

The lockedBy field stores a description of the system that created the lock, for example, order_update_lambda/<lambda_invocation_id>. That makes it easier to track down any error logs.

The created field stores the date (in UTC timezone) that the lock was created at. This gives you an indication of how long the lock has been in place and which time to look in any system logs to troubleshoot incomplete tasks.

The expectedCompletion field stores the date that by which time, if the work hasn’t completed, then there’s definitely a problem that needs investigating.

The actualCompletion field is null when the record is first created, but is populated with the date and time when the work is completed.

The result field can be used to make sure that duplicate calls to a service return the same result. It’s null when the record is first created, and is populated with data that’s unique to the service when the work is completed. For example, if we want to create an idempotent API, we’d store the JSON-encoded API return result in this field. When a customer calls our service with a previously used key, we’d return this value. To make sure that different API consumers are unable to access the data from other customers, the lockToken field could be prefixed by the customer ID, e.g. customer_id/idempotency_key. One point to note is that DynamoDB has a limit of 400KB per record, so if the result is larger than that, you’ll need to use an alternative database.

The ttl field is set to the a Unix datestamp that describes when the DynamoDB’s time-to-live feature should delete the idempotency records and allow them to be used again. If left null, then the record could never be re-used. This requires enabling the DynamoDB TTL feature on the table to take effect.

The status field is the most complicated. It’s used to store one of 3 values - inProgress for work that is executing, error for work that has failed, and null for work that has completed. This allows the field to be used to place a global secondary index on the table, using the status field as the partition key, and the lockId field as the sort key. Querying by this global secondary index then provides a list of locks that are in progress, but not completed (inProgress) and locks where the work has failed and needs human intervention (error).

If your system is going to have more than 10GB of locks in an inProgress or error state, then a more complicated setup would be required, because the maximum size of data stored under a single partition key in DynamoDB is 10GB. In that case, the system would need to use multiple values for the status field (e.g. choose a random value out of inProgress_1 or inProgress_2 to support up to 20GB of records), and make 2 queries against the global secondary index to collect the results.

Using the lock token table

Try and obtain a lock token

The first thing the process must do is attempt to create a record in DynamoDB, setting the lockToken, lockedBy, created, expectedCompletion, ttl and status fields to appropriate values, but leaving the actualCompletion date and the result set to null, because the work hasn’t been done yet. To ensure that only one process is acquiring the lock token, we can use a DynamoDB Condition Expression to ensure that the lockToken field is not present (attribute_not_exists(lockToken)).

Once a token has been created, there’s no way for anything to attempt to do the work again. Any further attempts will fail to create the lock because it already exists in the database and the conditional expression will fail, so we can be sure that only one attempt will succeed. We can use the handy feature of DynamoDB that a PutItem operation can also return the a value to also collect the existing lock if one already exists.

The possibible outcomes of this stage are:

  • Error occurred accessing the database.
  • Failed to create the lock token, because another process is currently doing the work.
  • Failed to create the lock token, because the work has already been completed by another process.
    • Here, we can decide whether to throw an error, or return the result of the previously completed work by reading the result field.
  • Successfully created the lock.

Do our work and mark the token as complete

If a lock was created successfully, the process does its work, using a try/catch to be able to take different actions depending on the type of errors found while doing the work.

If our once-only work is to call a 3rd party API, but a network fault means that we were unable to call it, then we’re probably happy to try again later, so we’d want to catch the error and completely delete the lock token to allow another process to try later.

If the API call was made successfully, but returned an error, we may be able to retry, but that depends on the API call. If we’re unable to retry safely, and we don’t want a human to take action, we should mark the lock token as complete, with an error stored in the result field. If we’re unable to retry safely but we want human intervention, then the lock token should be marked as complete as before but with the value error stored in the status field to make it easy to find failed work from the database side as well as the system logs.

After successful completion of the work, the process should execute an UpdateItem to set the actualCompletion date to the current time, set the result to a useful value for replay, and set the status value to null to remove it from the list of in progress tasks.

Since we’re writing to a database to store the result, there’s a chance that although the work completed, but the process didn’t write to the database at all, or was unable to mark the lock token as completed due to a system failure such as a timeout, network error, or unavailable database. To make sure that it’s possible to tell whether the work was actually completed or not, it’s essential to write whether the work done within a lockToken succeeded or not in the logs, before attempting to mark the lock token as complete.

Writing to logs also gives us a way to separately “sanity check” the data we have. Inability to update the lock token and set it to be complete probably isn’t worth throwing an error to consumers in most cases, but if systems rely on being able to get the same result for multiple calls with the same idempotency token, this scenario will result in those calls reporting back that another system is already doing the work, rather than returning the result.

Regardless of the outcome, it’s critically important to track metrics to enable alerting to take place.

Identifying problems with lock token processing

From an administrator’s perspective, we need to keep track if the system is working correctly, and know when to take action. There are a few things that should trigger our action:

  1. Tasks that should have completed by now, but haven’t.
  2. Tasks that did complete, but failed and require human intervention.
  3. An unusual number of inProgress lock tokens.

For item 1 in the list, we can write a process (e.g. a Lambda function that runs every 5 minutes) to poll the global secondary index on the DynamoDB table to find all of the in-progress work that should have been completed (based on the estimated completion date), and then to log their IDs to CloudWatch Logs and to write a metric of missingTasks that can be used to trigger alarms.

For item 2, we can write a Lambda function that queries the table for error states, but I’d just rely on a metric being created by the process itself to alert this.

For item 3, we can create an alarm on the difference between the number of started lock tokens vs the number of completed lock tokens.

Once we’ve identified the problem, we then need to find out the failed lock token ids. We can do this by querying the logs to find the affected lock token IDs. Once we have this, we can find out which process acquired the lock token by viewing the details in the lockedBy field, and use this information to search through its logs starting from the date stored in the created field to find out what happened.

If the process failed to complete its work and it’s safe to run it again, then we could remove the lock token from the database and trigger the job to run again, however, it was safe to run it twice, we wouldn’t have bothered with all this, so this is an unlikely scenario.

So, we’re left with reading through the logs to work out what action to take, if any, and updating the token lock DynamoDB record based on the action taken (e.g. clearing the error field, marking the token as complete).

TypeScript implementation

I’ve put together an example of this pattern at: https://github.com/a-h/once

Its usage is fairly straightforward:

const apiRequest = { orderId: "12345" };

// Use locker to make the response idempotent.
const locker = new Locker<APIResult>(db.client, db.name);
const expectedDurationMs = 60000;
const lock = await locker.begin(
  "order/create/" + apiRequest.orderId,
  os.hostname(),
  expectedDurationMs
);

// If we didn't create a new lock let's return what
// we've got.
if (lock.existing != null) {
  switch (lock.existing.status) {
    case Status.Error:
      // The error is just a string, to deal with as we see fit.
      return JSON.parse(lock.existing.error);
    case Status.InProgress:
      // Something else is doing the work right now.
      throw new Error(
        "Work is being done by another process, please try again later."
      );
    case Status.Complete:
      // It's already been done once, return the cached result.
      return lock.existing.result;
  }
}

// No lock existed until now, so we need to do the work.
try {
  // Do your work here, then call endWithSuccess.
  const result = { data: "success" } as APIResult;
  await lock.endWithSuccess(result);
  return result;
} catch (e) {
  // Errors are fatal, they'll need human interaction.
  await lock.endWithError(JSON.stringify({ error: "oh no" }));
}

Next

Part 3 looks at AWS and Stripe APIs that support idempotency.