Monitoring and Alerting

One of the challenges of launching a new system is supporting it. It’s our responsibility to know if the systems we provide have any problems before our customers need to tell us about it.

Web applications of any decent size are complex, and involve load balancers, multiple web server tiers, caching servers, file storage servers, message queueing systems and database systems, so we need some sort of automated monitoring to collect and report on health information.

At first, it seems like this is something that you should be able to buy a product to “do”, but most tools are focussed on red/green monitoring of servers, not about whether the application is actually working or not. It’s completely possible that all of the servers are sat with low CPU and plenty of disk space, while a massive database queue builds up and threatens to take weeks to clear. In addition, many of the tools are hard to set-up and configure, or require extra server infrastructure to run.

What did we want to be able to do?

Capture metrics (CPU, performance of server processing jobs, database queue sizes, log4net error rates over time, average search time) store them for a long time and be able to create graphs and reports from the data.
Alert on those metrics if the metrics fail to meet expectation, such as a disk’s free space being expected to drop to zero within the next 7 days based on past trends, or a web application being inaccessible / not performing within its expected parameters.
Investigate performance problems and update metrics collection and checks / alerting based on the outcome.

How did we go about doing it?

Inspired by the Etsy team’s tech blog and the book “Web Operations” we built a system which stores metrics (sent via RabbitMQ) in a database and a check system which triggers alerts to our support team.

Capturing Metrics

We built metrics collection into our products, and created a metrics service which publishes data from Windows Performance Counters to the same RabbitMQ exchange. This has provided lots of useful data, for example, spotting servers which didn’t have enough RAM to cover peak demand times.

We can take the database offline at any point for archiving, and the monitoring messages wait in RabbitMQ until a database is available again.

Data can be graphed against a time series (we use Sho from Microsoft Research) using C# and Entity Framework to collect the metrics for display. Here’s some graphs created using the system:

Checks / Alerting

Rather than doing something like Nagios’s Active checks, where the monitoring system connects to the server / system to gather data, we’ve taken the approach of passive checks, where each of the applications report metrics back to the metrics system and the checks happen at the monitoring system.

This has several benefits:

Heavy monitoring queries can’t affect the subject application’s performance, since they’re only executed on the monitoring system.
The metrics are meaningful, since each development team publish statistics which are relevant to their product’s purpose.
Metrics are stored in a single database in a consistent form, so writing checks doesn’t require extensive knowledge of the database design (for example) of system being monitored.
The monitoring application doesn’t need network access and authorisation to access systems.

Since all the data is stored in a single database, it’s easy to set up alerts (checks) on key application and server metrics. Any deviations from the expected metrics (including no metrics being available, or the metrics database being unavailable) results in a notification to our support desk.

The checks themselves are written in C# and Entity Framework, the Average, Min, Max and other extension methods make writing these checks relatively easy (compared to SQL). Since our main programming language is C#, it’s easy for anyone in the development team to write a check.

Picking a tool like Sensu would mean developers needing to learn another syntax (Ruby) to write checks in, as well as installing the Ruby runtime on the Windows servers.

Checks, such as the URL availability checks, check that:

Samples exist within the expected range.
The values for the samples all responded within the maximum allowed timeout period.

If a check fails, our support team are notified via the ZenDesk API, tickets can be escalated to out-of-office using ZenDesk triggers.

When to Notify? - Predicting Disk Space Outages

One of the major drivers for notifications is server disk space, but most monitoring systems opt for a “percentage free” threshold, e.g. you get an alert when there’s less than 10% free disk space.

On a 4 TB volume, 10% of that is 400GB, which might be enough to last for a week, or it might be enough to last for a day. Unless you know what the usage pattern is likely to be, then knowing that there’s 400GB left is not very useful in its own right.

Really, what we want to know is, “how long do we have left?”, since if more storage is really required (i.e. we can’t setup a job to delete some old logs, or archive old data etc.), then it usually requires some lead time.

Since our monitoring system stores the disk usage over time, we can use linear regression to work out when the disk drive is likely to fill up and add that into the alerting mix.

Here’s a toy graph of what that looks like:

The code to generate it is pretty straightforward, you can download the LinqPad script at:

http://share.linqpad.net/tuv86w.linq

void Main()
{
	var dates = Enumerable.Range(-7, 7).Select(daysInThePast => DateTime.Now.Date.AddDays(daysInThePast));
	var values = new double[] { 1000, 800, 700, 600, 500, 400, 300 };

	CreateRegressionGraph(dates, values);
}

private void CreateRegressionGraph(IEnumerable<DateTime> dates, IEnumerable<double> values)
{
	var xdata = dates.ToList();
	var ydata = values.ToList();

	// Find the date that we run out regression.
	var regression = Fit.Line(ydata.ToArray(), dates.Select(d => (double)d.Ticks).ToArray());
	var runout = new DateTime((long)regression.Item1);

	runout.Dump();

	// Draw a graph of it.
	regression = Fit.Line(dates.Select(d => (double)d.Ticks).ToArray(), ydata.ToArray());
	var intercept = regression.Item1;
	var slope = regression.Item2;

	// The actual data.
	var rangeA_X = xdata.ToArray();
	var rangeA_Y = ydata.ToArray();

	// The regression data.
	var rangeB_X = xdata.Concat(new DateTime[] { runout }).ToArray();

	//var rangeB_Y = ydata.Concat(new double[] { 0 }).ToArray();
	var rangeB_Y = rangeB_X.Select(d => (double)d.Ticks).Select(x => intercept + slope * x).ToArray();

	// Display a chart.
	var chart = new ShoChart()
	{
		Title = "Disk Space",
		HasLegend = true,
	};

	// Swap x and y so that the slope is downwards to zero.
	chart.AddSeries(rangeA_X, rangeA_Y);
	chart.SeriesNames[0] = "Actual Samples";
	chart.AddSeries(rangeB_X, rangeB_Y);
	chart.SeriesNames[1] = "Prediction " + runout.ToString("yyyy-MM-dd");
	chart.Dump();
}

But you do need to be careful when expanding existing the capacity of disk volumes. In this case, you should retrospectively update your metrics with the additional capacity, or the prediction may be that disk size will increase!

After I had the idea to do this, a little googling showed that a Monte Carlo simulation is likely a better approach:

http://lpenz.org/articles/df0pred-2/