Alerting on errors being logged in CloudWatch Logs, AWS Lambda failures, and HTTP 500 errors with Go CDK

In my last post [1], I covered setting up dead letter queues to alert on message handling failures in Lambda.

/2022/02/28/aws-go-cdk-encrypted-sqs-lambda-dead-letter-queue [1]

In this post, I’ll cover two other areas:

Getting alerted when Lambda functions are logging errors.
Getting alerted when a Lambda function returned an error, or failed.

Is everything working?

While this post is focused on alerting on errors in logs, I’m also interested in logging things that aren’t errors.

Sometimes people will ask “Is everything working?”, or “Is everything alright with the system?”, or “I’ve got a complaint about order 1234, what happened?”.

I don’t want my answer to be “No errors were logged, so it’s probably OK.”, but to be able to give a timeline of what happened during sets of interactions.

To do this, it’s critical to log what happened successfully, as well as what failed in your code, so that you can answer the question with real information.

I like to see:

Information about what’s going to happen, and any reference data, such as customer ID I’m sending to a service.
Infomation about what actually happened, and any reference data, such as a returned ID.
Warnings, if something non-critical happened.
Errors, if they happened.

I often find that comments in code that explain what’s happening should really be a log message instead. You only get to look at comments when you’re reading the code, but log messages are what you read when you’re under pressure to solve a production problem quickly.

How should logs be formatted?

Use structured JSON logging.

CloudWatch Logs, CloudWatch Metrics and CloudWatch Log Insights all support log filtering and extraction from JSON logs, so using JSON logging makes working with log data much simpler.

The key change is to stop trying to build up a string containing data as your log entry. You should log the name of the “event” message, along with fields containing key/value pairs of associated data.

In Go, I use Uber’s zap package [2] to do this.

https://github.com/uber-go/zap [2]

package main

import (
	"go.uber.org/zap"
)

func main() {
	log, err := zap.NewProduction()
	if err != nil {
		panic("could not create logger: " + err.Error())
	}
	log.Error("fetch failed",
		zap.String("url", "https://example.com"),
		zap.Int("attempt", 3),
	)
}

Example code in Go playground

The example code outputs a single line of JSON to standard output:

{"level":"error","ts":1257894000,"caller":"sandbox3700361711/prog.go:12","msg":"fetch failed","url":"https://example.com","attempt":3,"stacktrace":"..."}

How do you know if errors are being logged?

While logging errors is useful once you’ve been alerted to an issue, you also need to be alerted to the fact that the errors are happening.

Looking at the logs every day is one approach, but I work on systems that are being updated constantly with new features. Waiting until tomorrow to check the logs would result in the loss of large quantities of business.

Like other repetitive tasks, it needs to be automated. We want errors being written to the logs to result in an alert.

How do I get an alert from a log?

CloudWatch has a feature called metric filters [3] that can extract data from JSON logs and turn them into metrics.

https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html [3]

Once CloudWatch has a metric, an alarm can be set against it, which can trigger an alert to notify you:

Log → Metric → Alarm → Alert

Setting up log alerts on Lambda functions

Using CDK, I can create a Go Lambda function, and attach everything required to alert me to any errors using the addErrorsLoggedAlarm helper function.

f := awslambdago.NewGoFunction(stack, jsii.String("Handler"), &awslambdago.GoFunctionProps{
	Runtime:    awslambda.Runtime_GO_1_X(),
	Entry:      jsii.String("./http"),
	Bundling:   bundlingOptions,
	MemorySize: jsii.Number(1024),
	Timeout:    awscdk.Duration_Millis(jsii.Number(15000)),
	Tracing:    awslambda.Tracing_ACTIVE,
	Environment: &map[string]*string{
		"AWS_XRAY_CONTEXT_MISSING": jsii.String("IGNORE_ERROR"),
	},
})
addErrorsLoggedAlarm(stack, jsii.String("HandlerErrorsLogged"), f, 1, alarmTopic)

The helper function creates the metric filter, sets an alarm on the metric, and configures the alarm to notify the SNS topic if it’s triggered.

func addErrorsLoggedAlarm(stack awscdk.Stack, id *string, f awslambdago.GoFunction, errorsIn5MinuteWindow int, alarmTopic awssns.ITopic) {
	metricNamespace := f.Stack().StackName()
	awslogs.NewMetricFilter(stack, jsii.String(*id+"_MF"), &awslogs.MetricFilterProps{
		LogGroup:        f.LogGroup(),
		MetricNamespace: metricNamespace,
		MetricName:      jsii.String("errorsLogged"),
		FilterPattern: awslogs.FilterPattern_Any(
			awslogs.FilterPattern_StringValue(jsii.String("$.level"), jsii.String("="), jsii.String("error")),
			awslogs.FilterPattern_StringValue(jsii.String("$.level"), jsii.String("="), jsii.String("ERROR")),
		),
		MetricValue: jsii.String("1"),
	})
	errorsAlarm := awscloudwatch.NewAlarm(stack, jsii.String(*id+"_Alarm"), &awscloudwatch.AlarmProps{
		AlarmDescription: jsii.String("Error logged by service."),
		AlarmName:        jsii.String(*id + "_ErrorsLoggedAlarm"),
		Metric: awscloudwatch.NewMetric(&awscloudwatch.MetricProps{
			MetricName: jsii.String("errorsLogged"),
			Namespace:  metricNamespace,
			Statistic:  jsii.String("sum"),                     
			Period:     awscdk.Duration_Minutes(jsii.Number(5)),
		}),
		EvaluationPeriods:  jsii.Number(1),
		DatapointsToAlarm:  jsii.Number(1),
		ComparisonOperator: awscloudwatch.ComparisonOperator_GREATER_THAN_THRESHOLD,
		Threshold:          jsii.Number(float64(errorsIn5MinuteWindow)),
		ActionsEnabled:     jsii.Bool(true),
		TreatMissingData:   awscloudwatch.TreatMissingData_NOT_BREACHING,
	})
	errorsAlarm.AddAlarmAction(awscloudwatchactions.NewSnsAction(alarmTopic))
}

What about HTTP 500 errors?

The fact that an error was logged is usually enough to work out what went wrong, but checking on the rate of HTTP 500 errors for web applications can be worthwhile too. Especially if you don’t have control of the application code.

I’ve made a similar helper function to handle that too.

fi := awsapigatewayv2integrations.NewHttpLambdaIntegration(jsii.String("DefaultHandlerIntegration"), f, &awsapigatewayv2integrations.HttpLambdaIntegrationProps{})
endpoint := awsapigatewayv2.NewHttpApi(stack, jsii.String("ApiGatewayV2API"), &awsapigatewayv2.HttpApiProps{
	DefaultIntegration: fi,
})
// Alert on > 40% server errors within a 5 minute window.
addAPIGatewayErrorsAlarm(stack, jsii.String("HTTPApi500Errors"), endpoint, 0.4, alarmTopic)

The helper function is simpler than the metric filter, because API Gateway provides a built in ServerError metric to use.

The endpoint.MetricServerError function call in the helper function gets that particular metric.

func addAPIGatewayErrorsAlarm(stack awscdk.Stack, id *string, endpoint awsapigatewayv2.HttpApi, ratePercentage float64, alarmTopic awssns.ITopic) {
	errorsAlarm := awscloudwatch.NewAlarm(stack, id, &awscloudwatch.AlarmProps{
		AlarmDescription: jsii.String("API Gateway 500 errors."),
		AlarmName:        id,
		Metric: endpoint.MetricServerError(&awscloudwatch.MetricOptions{
			Statistic: jsii.String("Average"),
			Period:    awscdk.Duration_Minutes(jsii.Number(5)),
		}),
		EvaluationPeriods:  jsii.Number(1),
		DatapointsToAlarm:  jsii.Number(1),
		ComparisonOperator: awscloudwatch.ComparisonOperator_GREATER_THAN_THRESHOLD,
		Threshold:          &ratePercentage,
		ActionsEnabled:     jsii.Bool(true),
		TreatMissingData:   awscloudwatch.TreatMissingData_NOT_BREACHING,
	})
	errorsAlarm.AddAlarmAction(awscloudwatchactions.NewSnsAction(alarmTopic))
}

What if the Lambda function itself fails?

For Lambda functions driven by SQS, EventBridge, Kinesis, S3 events etc., the Lambda function signature returns an error rather than returning a HTTP 500 response.

func Handler(ctx context.Context, event events.CloudWatchEvent) (err error) {
	log := log.Info("event received", zap.String("source", event.Source), zap.String("type", event.DetailType))
	return nil
}

With a handler like this, as well as the dead letter configuration from the previous post, you might want to add an error alarm to alert if the rate of errors increases.

onEventHandler := awslambdago.NewGoFunction(stack, jsii.String("OnEventHandler"), &awslambdago.GoFunctionProps{
	MemorySize: jsii.Number(1024),
	Timeout:    awscdk.Duration_Seconds(jsii.Number(60)),
	Entry:      jsii.String("./onevent"),
	Bundling:   bundlingOptions,
	Runtime:    awslambda.Runtime_GO_1_X(),
	// Dead letter handling configuration.
	RetryAttempts:          jsii.Number(2),
	DeadLetterQueue:        onEventHandlerDLQ,
	DeadLetterQueueEnabled: jsii.Bool(true),
})
// Scrape JSON logs for errors and alert if any are found.
addErrorsLoggedAlarm(stack, jsii.String("OnEventHandlerErrorsLogged"), onEventHandler, 1, alarmTopic)
// Alert if over 40% of requests within a 5 minute window throw errors.
addLambdaErrorsAlarm(stack, jsii.String("OnEventHandlerLambdaErrors"), onEventHandler, 0.4, alarmTopic)

The addLambdaErrors alarm takes a percentage rate.

func addLambdaErrorsAlarm(stack awscdk.Stack, id *string, f awslambdago.GoFunction, ratePercentage float64, alarmTopic awssns.ITopic) {
	errorsAlarm := awscloudwatch.NewAlarm(stack, id, &awscloudwatch.AlarmProps{
		AlarmDescription: jsii.String("Error logged by service."),
		AlarmName:        id,
		Metric: f.MetricErrors(&awscloudwatch.MetricOptions{
			Statistic: jsii.String("Average"),
			Period:    awscdk.Duration_Minutes(jsii.Number(5)),
		}),
		EvaluationPeriods:  jsii.Number(1),
		DatapointsToAlarm:  jsii.Number(1),
		ComparisonOperator: awscloudwatch.ComparisonOperator_GREATER_THAN_THRESHOLD,
		Threshold:          &ratePercentage,
		ActionsEnabled:     jsii.Bool(true),
		TreatMissingData:   awscloudwatch.TreatMissingData_NOT_BREACHING,
	})
	errorsAlarm.AddAlarmAction(awscloudwatchactions.NewSnsAction(alarmTopic))
}

Where do all the alarms go?

Much like logging only gets you so far, if you don’t know that the errors are happening, then you’re not going to be able to do anything about it.

On my latest project, I’m using Atlassian OpsGenie to handle team rotas and escalations [4], since we were already using JIRA (not my choice).

https://www.atlassian.com/software/opsgenie [4]

OpsGenie has a mobile app, and will also email and phone us when triggered.

It’s also possible to configure escalation rules based on date/time, priority, and other things. I use this to make sure that only P1 / P2 issues result in alerts outside of the team’s normal working hours, and to filter out noisy alarms until we fix them properly at the infrastructure side.

OpsGenie has an SNS integration (a somewhat fancy name for receiving a webhook), so it’s pretty easy to connect into the alarm system.

Full example

A complete, working example is available at [5].

https://github.com/a-h/aws-go-cdk-deadletter [6]