Exporting CloudWatch logs for analysis using AWS Athena

At Well, we’ve been building a better pharmacy [0] using Serverless technology.

https://blog.well.co.uk/were-building-a-better-pharmacy-50bc2e04cd97/ [0]

When we write services for our customers, we need to make sure that we know that it’s working, and that it’s performing well before our tell us by getting in touch with us, or worse, just walking away. Our services write out a lot of metrics and information about what they’re doing to log files.

We’re using Go as our main backend language and are really keen on structured logging, which allows our log files to be mined for information and metrics about how our services are performing.

We use logrus [1] for that task. Here’s a real example of what our logging code looks like:

https://github.com/sirupsen/logrus [1]

logger.For(pkg, "Post", claims.UserID).
    WithField("orderID", orderID).
    WithField("trackingCode", shipment.TrackingCode).
    WithField("shipmentId", shipment.ShipmentID).
    WithField("shippingLabelsCreated", 1).
    Info("shipping label created")

As you can see, it’s more like an event than a typical log entry. It’s also a flat structure, since this makes it easier to extract data later on.

If we see that a customer experienced a problem, we can drop their unique user id into our log search tool and see exactly what happened.

When log entries are written, they looks something like this:

2018-03-29T14:40:23.417Z {"fn":"Post","level":"info","msg":"order cancelled","orderID":1,"ordersCancelled":1,"pkg":"github.com/welldigital/pharmacy/order/cancel","time":"2018-03-29T14:40:23Z","v":"459"}

From the example above, it’s possible to extract useful data about how many shipping labels were created in a time period, just by adding up all of the values of shippingLabelsCreated we find in the logs.

CloudWatch has a metric extraction feature we use to do this. You can see examples of how to set this up in your own projects in the [2] project we’ve open sourced, which tracks how long Web requests (HTTP handlers) took to execute, and what the status code was.

https://github.com/welldigital/responselogger [2]

This was a really low barrier to entry, we didn’t have to do anything complex to get high quality metrics. Just dump the logs to stdout, setup extraction and we get a really good way of visualising our data. (We use Grafana [3] for that, it’s actually the only EC2 instance we run ourselves).

https://grafana.com/ [3]

We’re trying out a few analytics tools for more advanced use cases at the moment, so we wanted to export the data from CloudWatch to import into those tools. It’s possible to use the aws logs tool to download data directly to your computer, but there’s also a built in feature in the AWS Console to do this, it copies the data to S3.

Once it’s in S3, it’s pretty easy to download it locally (aws s3 cp s3://your-log-s3-bucket/ ./data --recursive) for some analysis, or leave it in the bucket and allow other tools to access it from there.

One tool we wanted to try out for broader use was the AWS Athena tool, which allows execution of queries against data that’s held in S3 buckets to see how useful it was for us, so we pointed Athena at the log data export path in S3 and saw … nothing.

It turns out that the export format of CloudWatch and Athena didn’t quite work well together and there were some problems with the data anyway.

Exported data is in .gz files, which is supported by Athena, but not all tools support this.
There’s a directory hierarcy rather than putting all the exported files in a single path.
Each JSON log line is prefixed with a date which means that tools don’t recognise the log line as JSON.
The logs contain some automatically generated log entries which aren’t JSON at all.

Here’s some examples of the automatically generated log entries:

2018-03-29T14:40:02.823Z END RequestId: 10901dca-335f-11e8-bf9e-cbeab5ebadca
2018-03-29T14:40:02.823Z REPORT RequestId: 10901dca-335f-11e8-bf9e-cbeab5ebadca Duration: 55.17 ms  Billed Duration: 100 ms     Memory Size: 512 MB Max Memory Used: 36 MB
2018-03-29T14:40:22.453Z START RequestId: 1c4c43f7-335f-11e8-afb1-9b9ac2e2a4e9 Version: $LATEST

So we needed to filter out the stuff we didn’t need and put everything in a single directory to get Athena to read it.

Using my super-power of writing data processing command-line tools quickly, I wrote a filter tool to remove the date stamps and non-JSON lines. We’ve open-sourced it at [4].

https://github.com/welldigital/cw [4]

Once the data was cleansed, and uploaded to another S3 bucket, Athena was able to handle the data well.

In some ways, it’s modern, but downloading data to process it locally and re-uploading it feels a bit old fashioned (because it is!).

We’re now using [5] to filter and process events as they’re created and publish them to a Kinesis stream which keeps a copy of the logs in S3 while also feeding analytics tools we’re trying out.

https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Subscriptions.html [5]