adrianhesketh.com

Setting up an encrypted AWS Lambda dead letter queue with Go CDK

I’ve been building a system that uses EventBridge as the event bus to notify other systems about key events.

In the project, when a new item is sold, a 3rd party API must be called to register the customer for a service.

To do this, I subscribed a Lambda function to the EventBridge bus, and called their API in program code.

What happens if their API is down?

This event-driven approach decouples completing the financial transaction and calling the 3rd party API. We can still trade and take on new customers even if the 3rd party API is temporarily down.

This is a huge benefit. However, we would still need to get that 3rd API call made, when their API comes back up.

If you read through the docs, you’ll find that EventBridge automatically retries hitting targets for 24 hours.

By default, EventBridge retries sending the event for 24 hours and up to 185 times with an exponential back off and jitter, or randomized delay.

However, the target in my case is AWS Lambda, not the 3rd party API directly.

Of course, Lambda could go down, but then the whole of my application would be out of action anyway, so what matters even more in this case is how Lambda retries in the face of failure.

Lambda manages the function’s asynchronous event queue and attempts to retry on errors. If the function returns an error, Lambda attempts to run it two more times, with a one-minute wait between the first two attempts, and two minutes between the second and third attempts.

For throttling errors (429) and system errors (500-series), Lambda returns the event to the queue and attempts to run the function again for up to 6 hours. The retry interval increases exponentially from 1 second after the first attempt to a maximum of 5 minutes.

So in the case of the Lambda function code returning an error or failing, the Lambda execution will be attempted again up to 3 times over a 5 minute period.

What happens if their API is down for more than 5 minutes?

Lambda will throw the message away, and the customer will never get the service. That they paid for.

I didn’t want that to happen, so I set up a dead letter queue.

What’s a dead letter queue?

A dead-letter queue, or DLQ is where messages that can’t be processed get sent.

AWS Lambda can be configured to send failed messages to various services, such as SQS queue, SNS topic or Lambda functions, so that the messages can be recovered and attempted again.

How would you know if the processing had failed?

I have an alarm set up on the AWS Lambda failed invocations metric, and have it hooked up to SNS to notify me.

I also have an alarm set up on the queue length of the SQS queue, so that I can tell whether something has been added to the queue.

This alarm is more important, because although the Lambda failed invocations metric tells me that something is failing, those messages may still get retried automatically, whereas the ones that make it to the SQS queue may require human intervention (it’s also possible to configure retry behaviour on the SQS queue, although typically, I don’t do this, since the failure rate is so low).

How do you do it?

CDK provides a construct in the github.com/aws/aws-cdk-go/awscdklambdagoalpha/v2 package.

If you set DeadLetterQueueEnabled to true, you magically get a dead letter queue.

onEventHandler := awslambdago.NewGoFunction(stack, jsii.String("OnEventHandler"), &awslambdago.GoFunctionProps{
	MemorySize: jsii.Number(1024),
	Timeout:    awscdk.Duration_Seconds(jsii.Number(60)),
	Entry:      jsii.String("./onevent"),
	Bundling:   bundlingOptions,
	Runtime:    awslambda.Runtime_GO_1_X(),
	// Dead letter handling configuration.
	RetryAttempts: jsii.Number(2),
	DeadLetterQueueEnabled: jsii.Bool(true),
})

Unfortunately, it’s not usable for me as-is, because it’s not encrypted at rest, and most of my code runs against customer data which must be encrypted at rest.

Since it doesn’t encrypt the queue contents, it will show up in AWS Security Hub as a medium vulnerability finding.

You can verify if yourself by using cdk synth and looking at the generated CloudFormation template.

  OnEventHandlerDeadLetterQueueA5165245:
    Type: AWS::SQS::Queue
    Properties:
      MessageRetentionPeriod: 1209600
    UpdateReplacePolicy: Delete
    DeletionPolicy: Delete
    Metadata:
      aws:cdk:path: AWSGoCDKDeadletterStack/OnEventHandler/DeadLetterQueue/Resource

Since late 2021, AWS SQS has supported transparent encryption at rest using SQS-SSE [1], but there’s no CloudFormation support yet [2], so CDK can’t support it yet [3].

So, for almost all systems, I need to do extra work.

In addition, there’s no alarm, and no notification to tell me that there’s an alarm.

This means that the messages will just sit there in the queue until I notice.

So, how do you really do it?

The basic requirements are to:

  • Create an SNS topic that encrypts the queue contents at rest
  • Create an alarm if the queue depth is greater than 1
  • Cause an alarm to trigger an SNS notification

Create a shared SNS topic to send alerts to.

This just needs to be added once to a stack, or to a global shared stack that you use everywhere.

func addAlarmSNSTopic(stack awscdk.Stack) awssns.Topic {
	alarmEncryptionKey := awskms.NewKey(stack, jsii.String("AlarmTopicKey"), &awskms.KeyProps{})
	alarmEncryptionKey.AddToResourcePolicy(awsiam.NewPolicyStatement(&awsiam.PolicyStatementProps{
		Actions: &[]*string{
			jsii.String("kms:Decrypt"),
			jsii.String("kms:GenerateDataKey"),
		},
		Effect: awsiam.Effect_ALLOW,
		Principals: &[]awsiam.IPrincipal{
			awsiam.NewServicePrincipal(jsii.String("cloudwatch.amazonaws.com"), &awsiam.ServicePrincipalOpts{}),
		},
		Resources: &[]*string{jsii.String("*")},
	}), jsii.Bool(true))
	topic := awssns.NewTopic(stack, jsii.String("AlarmTopic"), &awssns.TopicProps{
		DisplayName: jsii.String("alarmTopic"),
		MasterKey:   alarmEncryptionKey,
	})
	topic.AddToResourcePolicy(awsiam.NewPolicyStatement(&awsiam.PolicyStatementProps{
		Actions: &[]*string{jsii.String("sns:Publish")},
		Effect:  awsiam.Effect_ALLOW,
		Principals: &[]awsiam.IPrincipal{
			awsiam.NewServicePrincipal(jsii.String("cloudwatch.amazonaws.com"), &awsiam.ServicePrincipalOpts{}),
		},
		Resources: &[]*string{topic.TopicArn()},
	}))
	awscdk.NewCfnOutput(stack, jsii.String("AlarmTopicArn"), &awscdk.CfnOutputProps{
		ExportName: jsii.String("alarm-topic-arn"),
		Value:      jsii.String(*topic.TopicArn()),
	})
	awscdk.NewCfnOutput(stack, jsii.String("AlarmTopicName"), &awscdk.CfnOutputProps{
		ExportName: jsii.String("alarm-topic-name"),
		Value:      jsii.String(*topic.TopicName()),
	})
	return topic
}

Create a dead letter queue for each Lambda function

It’s very easy to add KMS_MANAGED encryption, just another line. I don’t know why it’s not the default.

onEventHandlerDLQ := awssqs.NewQueue(stack, jsii.String("EventHandlerDLQ"), &awssqs.QueueProps{
	Encryption:      awssqs.QueueEncryption_KMS_MANAGED,
	RetentionPeriod: awscdk.Duration_Days(jsii.Number(14)),
})

Add an alarm to the dead letter queue that triggers if it contains any messages

addDLQAlarm(stack, jsii.String("EventHandlerDLQAlarm"), onEventHandlerDLQ, alarmTopic)

I have a helper function for this.

func addDLQAlarm(stack awscdk.Stack, id *string, dlq awssqs.IQueue, alarmTopic awssns.ITopic) {
	alarm := awscloudwatch.NewAlarm(stack, id, &awscloudwatch.AlarmProps{
		AlarmDescription: jsii.String("Queue depth alarm for DLQ."),
		AlarmName:        id,
		Metric: dlq.Metric(jsii.String("ApproximateNumberOfMessagesVisible"), &awscloudwatch.MetricOptions{
			Statistic: jsii.String("Maximum"),
			Period:    awscdk.Duration_Minutes(jsii.Number(5)),
		}),
		EvaluationPeriods:  jsii.Number(1),
		DatapointsToAlarm:  jsii.Number(1),
		ComparisonOperator: awscloudwatch.ComparisonOperator_GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
		Threshold:          jsii.Number(1),
		ActionsEnabled:     jsii.Bool(true),
		TreatMissingData:   awscloudwatch.TreatMissingData_NOT_BREACHING,
	})
	alarm.AddAlarmAction(awscloudwatchactions.NewSnsAction(alarmTopic))
}

Configure your event handling Lambda function to use the dead letter queue

Note the use of the DeadLetterQueue paramater.

onEventHandler := awslambdago.NewGoFunction(stack, jsii.String("OnEventHandler"), &awslambdago.GoFunctionProps{
	MemorySize: jsii.Number(1024),
	Timeout:    awscdk.Duration_Seconds(jsii.Number(60)),
	Entry:      jsii.String("./onevent"),
	Bundling:   bundlingOptions,
	Runtime:    awslambda.Runtime_GO_1_X(),
	// Dead letter handling configuration.
	RetryAttempts:          jsii.Number(2),
	DeadLetterQueue:        onEventHandlerDLQ,
	DeadLetterQueueEnabled: jsii.Bool(true),
})

Results and sample code

Complete example code showing all of the alerts is available at [4].

Below are screenshots that show how it works in the console.