Immutable Windows Infrastructure on AWS

Immutable Infrastructure describes a situation where you don’t change infrastructure (usually servers) when they’re in-place, you just replace them with new ones. This normally goes hand-in-hand with automation of that work and provides a few benefits:

You can build your operating system patching process into your deployment process, there’s not a separate process and tooling (e.g. Chef, Ansible) for it
If your servers have been compromised, there is a smaller window of opportunity to exploit that further, because the servers will be destroyed and recreated more frequently (traditionally, servers have a lifespan of years)
Since your servers are usually built from code, you may be able to completely remove human access (e.g. SSH or Remote Desktop (RDP))
Automating the creation of environments is great for disaster recovery and creating new test environments when required
- I like to use Terraform for all that, but recent versions of Cloud Formation look OK

How does it work?

Auto Scaling Groups in AWS is a feature which (as the name suggests) allows you to configure rules which set when servers are added and removed to a pool. You can set it up to have minimums, maximums, change depending on how much load is hitting a load balancer, or just set to a schedule.

The machines started up by the Auto Scaling group all get configured to run a Powershell script (User Data in AWS terminology) to get themselves ready to receive traffic when they launch. At a minimum, that means installing some software to run, usually by downloading it from an S3 bucket, configuring logging etc.

It took me a few attempts to get the Powershell scripts right, but once it was working consistently, I was able to disable RDP access at the Security Group (firewall) level to prevent access to the boxes.

Here’s a sequence diagram of the process.

The Lambda which terminates old versions runs a Go program I wrote called the Terminator to do it: [0]

https://github.com/a-h/terminator [0]

AWS Lambda doesn’t support Go programs directly, so I used a wrapper node.js wrapper to execute it:

The process is pretty straightforward and starts by logging to the build server and starting a deployment.

Selecting which environment you want to push to

Choosing which version to deploy:

An AWS Lambda is then triggered by the upload of the new deployment zip files to S3. This destroys a single instance in both the Web and API tiers, which we can see on our Grafana monitoring dashboards:

The Auto Scaling Group here is set to have 3 servers, so the healthy instance count should drop from 3 to 2 as the first server is destroyed.

This behaviour can be verified in the AWS Console (EC2 Auto Scaling Group). The Auto Scaling group logs the fact that an instance was automatically terminated, and a new instance is created to replace it:

The terminate_old_instances Lambda fires on a timer every 5 minutes. It collects the version number of all active instances and, if it finds old software running on an instance, it destroys the instance (to a configured minimum). Auto Scaling then kicks in again and creates a new instance running the latest version of the software.

To get this to work, I introduced a HTTP endpoint in all of the servers called “/version” which simply returns the version number of the running application.

The Lambda events are logged in CloudWatch:

2016-10-10T09:36:04.870Z    f5d15cb1-8ecc-11e6-94bc-5188e6ae168c    stdout: live_asg_web =&gt; lowest version 1.0.266, highest version 1.0.274, 2 instances to terminate
live_asg_web =&gt; terminating 2 of 3 instances
live_asg_web =&gt; terminating instance ids [i-01eb1b24b1e5d8a3e i-0ae0fbd754e6d71d9]

The example above shows that 2 instances are running 1.0.266 and one is running 1.0.274. The old instances can therefore be terminated and replaced with new instances running 1.0.274.

So our monitoring shows a server coming in and another one being destroyed.

As the new servers start up, they appear in the CPU graphs with higher CPU. This is expected, as starting up a .Net application results in JIT compilation and the creation of an IIS Application Pool process.

At the end of the release cycle, we’re back to 3 active nodes.

What I learned

There were a few things to know if you’re doing this yourself:

A new AWS Windows image is released every month, a few days after Patch Tuesday, with all of the latest updates installed. You should switch your auto-scaling group to using this to get the latest updates.
Windows instances are slow to start up. It takes about 15-20 minutes to start an instance, install IIS, .Net 4.6.1 etc.
To save time launching we build a new base image with IIS and other pre-requisites installed once per month and use that as our base image. This got us down to about 5-6 minutes.
Packaging binaries for AWS Lambda from Windows doesn’t work! You can’t (easily) set the executable (+x) filesystem attribute in the zip file. You’ll need to create the zip packages in OSX or Linux etc.

How do I get started?

I put together an example of how to use the Terminator as a Linux-based equivalent over at [1]

https://github.com/a-h/terminator-example [1]