Thoughts on team metrics

You probably already know from the title of this post that you’re going to read some variation of the phrase “you can’t improve what you can’t measure”, a phrase that’s often attributed to Peter Drucker, so let’s just get that out of the way first.

Much like good song lyrics, there’s a few readings of the phrase you can take, based on how you feel. You could read it as mystical - you can’t improve something you can’t measure, like a feeling of love or happiness, or you could read it as a statement of scientific management - data will save the day and lead us to infinite progress.

I think many of us want the reassurance of being told that we’re doing OK. Many of us seek out feedback, via conversations with friends, likes on Facebook, while others seek out the same via customer satisfaction forums and the like.

As a consultant, I work with a lot of teams, and so the question I often get asked is “How good is this/that/my/their team?”. It gets asked by all sorts of people at different levels and roles in an organisation, and is usually impossible to answer without first defining what their interpretation of “good” is, and what the context is.

Defining good

Unfortunately, “good” is complicated. It’s interpreted differently depending on what your aims are, and what you value.

When we think about product engineering efforts, good looks very different at different stages of a product. At the startup stages, we might be more concerned about trying to find product/market fit and validated learning [1].

http://theleanstartup.com/principles [1]

As an organisation finds a successful product, and starts to scale, “good” might be the ability for multiple teams to work on the product at once to get more features shipped to customers. The cost of cloud infrastructure or even the cost of the engineering team may not be a major driver compared to the cost of losing market share to competitors.

In large organisations, it may be that managing complexity is the biggest challenge, and that “good” is everything working in a similar way, or built with a particular technology, regardless of whether it’s the best solution for that task.

Ryan Singer, formerly of Basecamp fame sums it up in Shape Up [2] - “The ultimate meal might be a ten course dinner. But when you’re hungry and in a hurry, a hot dog is perfect.”

https://basecamp.com/shapeup/1.2-chapter-03 [2]

Running a VMOST [3] session can help to define that vision if it’s not already clear.

https://bad.tools/library/toolkits/vmost/ [3]

Choosing metrics

Once we’ve decided what “good” means to us, and what we value, we can start to make choices about how we measure that.

A key success criteria in choosing metrics is to choose metrics that allow you to take action. In Lean Analytics [4], Alistair Croll and Benjamin Yoskovitz differentiate these actionable metrics with “vanity metrics” that may give you a warm feeling, but don’t provide you with information you can act upon. For example, the number of website hits doesn’t tell you what you need to know about user behaviour to improve how you turn those hits into transactions.

https://leananalyticsbook.com/ [4]

In How to Measure Anything [5], Douglas Hubbard provides a guide that helps you avoid the trap of choosing vanity metrics.

https://www.howtomeasureanything.com/ [5]

First, there has to be a decision problem, a choice to make. In a product team, the main decision problems usually centre around deciding which improvements you’re going to make to a system or process.

Next, you can measure what you know now. For example, you might know the current percentage of people that drop out during a stage of a checkout process, how long a batch process takes, or how often a process requires costly manual intervention.

Then, the key question, how much extra value would having extra information provide? Is it worth the effort of collecting more data? How much effort is it worth spending to improve your decision making? Here, it’s a good idea to think about the scale of the value you might be able to deliver.

If you have a choice between two features where you think you have a good chance of getting a payoff of £500k to £1M, where the other is £50k to £100k, and you’re sure it’s not 10x more engineering effort to do the first feature, maybe you should just get started on it. If you’re looking for 100% certainty in any decision you make, you can end up with “analysis paralysis” and end up making no decisions at all.

To get value for money from information collection, product teams have developed low cost methods to get valuable insight. Measuring a market size, or carrying out some lean user research is often less expensive than building products. Building a minimal viable (or loveable) product, or faking an automated solution by carrying out backend processes manually (Wizard of Oz in the Lean Startup [6]) can help test whether customer demand is present before investing in developing an automated, scalable solution.

http://theleanstartup.com/book [6]

Finally, if the value of collecting additional measurements is there, go for it.

Sometimes, collecting the data that we want to measure is difficult. Hubbard’s book provides some inspiration in the form of various “back of a napkin” estimations, and how being in the right order-of-magnitude is often enough to get the job done. One of my favourites was the Erastosthenes story, thoroughly debunked at [7]

http://kiwihellenist.blogspot.com/2015/11/eratosthenes-and-well.html [7]

Regardless of the truthiness of the anecdotes within, Hubbard’s process isn’t designed just to avoid vanity metrics, but also to avoid wasting time collecting data that you won’t act on - it makes you take the critical step of committing to action.

Baseline

Despite needing to understand what constitutes “good” for an organisation, some metrics have more universal applicability, making them relevant across many organisations. I held an internal community of practice session at Infinity Works and some core metrics came back strongly for commercial projects:

Funnel analysis - how much drop-off do we get at each stage through the process?
Cost per acquisition - how much does it cost to make a sale (including marketing spend)?
Cost per order - how much does it cost to complete the order?
Gross / Net margin - how much profit do we make per order?
Lifetime customer value - how much do we estimate each customer will go on to spend?
NPS, Feefo score, AppStore score etc. - how do customers feel about the product?

However, turning these metrics into actions isn’t straightforward. Research is often required to understand why users are dropping out of an onboarding process, or not going on to use the service that they’ve signed up for.

It’s also the case that not every project can be tied directly to external customers. Projects that enable access to data, provide insights, or back-office systems will have different core metrics that contribute to overall organisation goals.

For example, the ultimate goal of a recent Infinity Works to deliver new data pipelines was not just to replace the existing technology with something else for the sake of technology, but to improve the granularity of reporting. The project took the customer from relying on daily reports to be able to adjust pricing and marketing spend, to being able to report in realtime. With this additional capability, the customer was able to maximise revenue and control demand during peak periods that are responsible for an outsized proportion of yearly revenue.

Teams that play a supporting role will need to be more creative, and find out what really matters to their internal customers. Projects I’ve been a part of have looked to all sorts of indicators:

Data lag - how long does it take between a sale happening and it being visible in reports?
Model performance - how much better is the latest machine learning model against a baseline?
Cost efficiency - how much is the cloud provider bill given the volume of traffic?
Build time - how many minutes does it take from committing code, to getting a completed build?
Lighthouse score - a score that combines web performance, search engine optimisation, accessibility support and other factors.

Taking it back to Base Camp’s kitchen metaphor, defining “good” is an important step in defining what to measure, but on the engineering side, if your team isn’t getting even as much as a hot dog out of the kitchen, then it’s not delivering any value at all. In these cases, the priority is often to get things moving again before looking at the value of what’s being delivered.

A well established set of engineering metrics are found in Accelerate [8]. They don’t measure if what they’re shipping is valued by customers, but they’re the go-to for measuring if a team is getting things out of the kitchen:

https://www.oreilly.com/library/view/accelerate/9781457191435/ [8]

The key indicators are:

Deployment frequency (more is better)
Lead time for changes (quicker is better)
Mean-time-to-resolution (quicker is better)
Change failure rate (lower is better)

Most of the actions to improve these metrics are based on improving and automating build and release (CI/CD) processes using off-the-shelf tools, introducing automated testing, and by re-engineering the product to improve observability reliability.

Traps

Meaningless comparisons

Reading through the various books on delivery techniques, a common theme emerges. The context of the team and its goal matters.

It can be tempting to try and compare teams within the same organisation, or between organisations using a common set of metrics. However, in practice, there are often far too many dimensions to be able to make a meaningful comparison between teams. The teams are made up of different people, with different skill sets, trying to achieve different goals, within different constraints.

Imbalanced metrics

To avoid another trap requires balancing what you measure. The Product Leadership [9] book highlights the importance of collecting a balanced set of metrics with a few examples of where focussing on one thing to the exclusion of everything else caused negative effects.

https://productleadershipbook.com/ [9]

When discussing this with a colleague, he called it the “Cobra effect” [10], named after an anecdotal story of India in the time of British colonialism where the government offered a reward for dead snakes. So the story goes, enterprising people started breeding cobras to maximise their reward, until finally, the government was overwhelmed with rewards and cancelled the scheme. Without the incentive, cobra breeders simply released their cobras into the wild, and so this scheme ultimately led to a situation where the population of cobras in India was greatly increased. I have no idea if that’s true, but what a story.

https://www.psychologytoday.com/gb/blog/machiavellians-gulling-the-rubes/201610/the-cobra-effect-good-intentions-perverse-outcomes [10]

To avoid these situations, the recommendation is to balance performance metrics such as revenue, Annual Recurring Revenue (ARR), and Customer Acquisition Cost (CAC) which tell you how your cash is going, with customer satisfaction metrics such as Lifetime Customer Value (LTV) and Net Promoter Score (NPS) to identify how customers feel about it.

Squeezing teams

Sometimes our drive to metrics is driven by a fear of missing out or being accused of wasting time. What happens if a developer is not “busy”? If a developer isn’t 100% filled up with tasks every minute of every day, there’s a risk that we might have wasted time that we could have spent building something. Shape Up [12] has some sage words there too.

To overcome these worries, shift the mindset from the micro to the macro. Ask yourself: How will we feel if we ship this project after six weeks? Will we feel good about what we accomplished? When projects ship on time and everyone feels they made progress, that’s the success. It doesn’t matter what exactly happened down at the scale of hours or days along the way. It’s the outcome that matters.

https://basecamp.com/shapeup [12]

Imbalanced metrics can often pit teams against each other. For example, it’s often impossible to truly separate engineering from product and marketing - everything has to work together. If marketing increases the number of people visiting a sign up form by offering free candy, you’ll get a stream of visitors, but how many of them actually want to buy a car? If a product team is trying to minimise drop-off rate through a sales process, they’ll take a hit on their metrics because of a decision made in marketing.

Engineering and product working with marketing to ensure that analytics allows campaigns to be measured and customer segments investigated is going to give the best results. Similarly, if marketing campaigns get people to the site, but then drop out because the site has bugs, or is difficult to use, or the product is out of stock, then the marketing budget is wasted. It’s better to work together to achieve common goals.

Feature factory

If you’re focussing on engineering metrics to the exclusion of the product, or you’ve got your technical delivery nailed, but you haven’t matured your product capability, you might see some of the dysfunctions mentioned in Jon Cutlefish’s 12 signs you’re working in a feature factory [13] blog post.

https://cutle.fish/blog/12-signs-youre-working-in-a-feature-factory [13]

One of the ways that the obsessive prioritisation found in a feature factory environment can play out is an outsized focus on development estimates and planning without much focus on the value being delivered. When teams are held accountable for delivering a feature on time, but when no customers use the feature that the team worked evenings and weekends to deliver, and there’s not even acknowledgement that the bet didn’t pay off, you can expect team burnout, and low engagement.

Basecamp’s Shape Up [12] process tackles this by defining “bets” that bound the maximum amount of investment we want to make on improving an area of the product, rather than focussing on how long things will take.

https://basecamp.com/shapeup [12]

Taking it one step further, somewhat counter-intuitively, the most productive teams I’ve ever been on haven’t done a lot of up-front planning or estimation, instead using a process similar to the one described by the accidental founder of the #NoEstimates Twitter, Woody Zuill, in an informative blog post [11]. I think the reason for the productivity is simply because executing meaningful work to a high standard is fulfilling, and completely engages the team.

http://zuill.us/WoodyZuill/2012/12/10/no-estimate-programming-series-intro-post/ [11]

Dashboard gazing

The plethora of metrics available in cloud platforms, and the ease of displaying them in tools like Grafana, DataDog or CloudWatch can result in constant dashboard gazing, but a lack of action. It requires expertise on the part of the watcher to understand normal. If a CPU is running at 90%, is that bad or not?

Encouraging teams to swap from looking at dashboards to defining actions that will be taken, can get things moving again. For example, we might define action as being that if any step in the sales funnel loses >4% throughput compared to the previous day, then we’ll stop people working on additional features and start an investigation to understand why.

Lies, damn lies, and statistics

It’s also possible to lie and cheat with statistics. For example, by discounting statistics that don’t align with your worldview, or by encoding biases into data and processes. One example of this was highlighted in a Guardian article [14] and book Invisible Women [15] by Caroline Criado-Perez, which talks about the negative effect of the choice of only having male-shaped crash test dummies. It seems to me that a more diverse technology team would have been more likely to have avoided that bias trap by adding female-shaped crash test dummies and saved lives.

https://www.theguardian.com/lifeandstyle/2019/feb/23/truth-world-built-for-men-car-crashes [14]

https://www.penguin.co.uk/books/111/1113605/invisible-women/9781784706289.html [15]

Whatever metric you choose, there’s bound to be someone that disagrees with it. The well-established Net Promoter Score (NPS) score is “considered harmful” [16] by some folks. Research research shows that a Customer Effort Score [17] is more closely correlated with repeat purchase, while Feature Fit [18] scores give you actionable insight into individual features that may be able to retire (to save maintenance costs) or rethink (to improve).

https://articles.uie.com/net-promoter-score-considered-harmful-and-what-ux-professionals-can-do-about-it/ [16]

https://www.parlor.io/blog/measure-your-products-usability-with-customer-effort-score/ [17]

https://www.parlor.io/blog/one-question-to-measure-new-feature-success/ [18]

It’s also possible to choose a metric that appears to be correlated with another metric, and imply that one thing causes the other to happen. For example, that global warming is caused by a reduction in pirates [19]. It’s such a common mistake that the phrase “Correlation does not imply causation” has its own wikipedia page [20].

https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation [19]

https://en.wikipedia.org/wiki/Flying_Spaghetti_Monster#Pirates_and_global_warming [20]

Needless to say, if you’re trying to affect a metric that has no real influence on an outcome, you’re probably wasting your time. You need to invest in affecting metrics that have predictive power.

Continuous improvement

Once obvious improvements have been made, there are often diminishing returns. This can be disheartening, but an accumulation of tiny improvements can make the difference. This idea was popularised by the cycling coach Sir Dave Brailsford as “marginal gains” [21]

https://www.bbc.co.uk/news/magazine-34247629 [21]

But while you grind away at tiny improvements, there’s always the possibility that some external force is going to be responsible for a paradigm shift that washes away the old, so we always need to reassess our strategies as time changes. I’m pretty sure that Blockbuster video had completely nailed their video and DVD return processes, but what does it matter if customers just start streaming content off the Internet to their smart TVs?

Hope

In any industry or part of life, there are people out there that claim to have found the solution. Enlightened folks that have solved it all for us. If we follow the processes that they’ve outlined in their books and training courses, then all will be well, and if it doesn’t work for us, then… we didn’t understand it or weren’t doing it right.

While trying to answer the question of what to measure, and why, I’ve come to the conclusion that it’s impossible to rely on a standard methodology to answer it all for us. We still need to do what my sports coach used to call the “hard miles”:

Defining what matters to us using techniques like VMOST and Value Stream Mapping.
Ensuring that we balance our metrics and avoid cognitive bias traps.
Ensuring that when we measure, we’re also committing to taking action.
Coming up with a way to measure the things that matter.
Deciding what we’re actually going to do.
Keeping our processes human-friendly.