Thursday, 24 October 2024

Effective Incident Response - Finalise all the Things!


The Problem

I was in my first post-consultancy gig and as well as having to deal with another dose of imposter syndrome (much as when I joined Thoughtworks back in the day) and problems within my own teams, it was also clear to me that the organisation wasn't dealing well with incidents. Whilst the response time to incidents seemed reasonable (although I had nothing concept of what "industry standard" may mean) and the time to restore service also seemed good (with the same caveat), it was clear that the overall number of incidents wasn't really going down.

The Analysis

A long time ago, back on one of my earliest engagements for Thoughtworks, myself and a colleague designed a workshop to persuade a client that they needed to invest in their build pipeline and general devops maturity. Our approach was to analyse how they handled incidents (of various types) and we segmented the process into Discover / Analyse / Fix / Finalise. I even came up with an acronym "DAFF" to describe this process. In that case, we weren't trying to influence their incident handling, rather we were trying to raise their (what we would probably now call) DevOps maturity. That earlier post can be read here.

I was told that we had a "mature" incident handling process and, on the face of it, this seemed a reasonable assertion. There was a weekly meeting to discuss incidents and schedule the washup meetings, teams seemed to be creating tasks and stories on the back of those meetings and they were mandated to ensure that stories that arose from incidents had to be prioritised. So surely the process was really good and should get good outcomes? Certainly that was the assumption that the CTO was taking.

So I then examined some of the stories that were on the Infrastructure team's backlog that arose from incidents. Again, they all seemed reasonable. Things like, "this EC2 instance ran out of memory, so redeploy a bigger instance" or maybe, "We got a burst of traffic which we couldn't handle (which may happen again) so we increased the size of the auto scaling group". All good, right? Well, not quite.

The Washup

I thought the only way I would understand things better would be to attend the washup meetings so I started asking if I could be invited.

Before I dive into the specifics of the example that made me realise what wasn't happening (and recall my DAFF cycle from several years earlier), it is worth pointing out that this particular company was fairly mature (certainly for a Fintech) and was carrying a lot of legacy. In particular the "original" implementation of the "platform", a big ugly old monolith, was gradually being strangled away by new services. It was still in existence and was conceptually a microservice (albeit a big one), and there were dozens of "new" services circling around it. I wasn't close to the process of "sharding off" services from the old monolith so at this point I had no real idea of coding standards, methods of reuse etc.

The first wash-up meeting I went to (it may not have been but it helps this narrative if we pretend it was the very first one) was about an system wide outage caused by a (still monolithic) Postgres database being unavailable (the code monolith was being split, it was "too hard" to split the database monolith, this is a whole other subject!). 

The call started well. I was impressed at how the incident had been handled. An incident Slack channel was created, the right people were pulled in, the person handling the incident maintained an accurate timeline of discovery, analysis and the fix, right up to the point where service was restored in a very timely fashion. Then we moved on to "what do we do next?"

The root cause analysis was gone through and the conclusion was that one of the new services accessed the database using some code that mishandled the database connection pool. Something to do with not releasing a connection back to the pool, therefore more and more connections were being initiated, none were being disposed and of course this eventually caused the server to refuse new connections. So the group agreed that the team responsible for the service that caused the outage should take on a story to "Fix Connection Pooling..." and everybody prepared to leave the call.

"Wait!", I interjected. "I think we've answered the question, 'how do we stop this service causing this outage?', but have we answered the question 'how do we ensure that other services won't cause a similar outage?'" To my recollection there were a few sighs but most in the call understood the point. So I asked if there was a common module that was used to access the Postgres monolith? No. OK, is there any guidance on how to write database connection code? No. OK, so how do we know that this issue isn't waiting to happen in other services? We don't.

The Finalisation

So that discussion led to a quick finalisation accord. That is to say, we agreed we needed to have a common module (or service) to access the database. I don't remember the exact implementation but the epic title was something like, "Standardise Access to the Platform Database". So then we were confident that this type of outage won't happen again.

One small corollary was that we then had a (leadership level) debate about ownership of common modules. The company, when I had come in, had a model of "shared ownership". Sadly, as is often the case, shared ownership largely played out as NO ownership. It was clear to everybody that this type of core functionality needed to be owned (not shared) but it wasn't clear by which team. In the end our conclusion was that the SRE function should own this piece and other things that are considered key to the reliability of the solution.

Lessons Learnt

There should be two main takeaways from this story:
  1. Every incident should have a washup that results in actions to make the system more resilient.
  2. Drill down into the root cause, AND ask one more "why" when you get there. As in, "this code here caused the problem - WHY is this code like this, and should it be HERE?" in order to better understand how to stop SIMILAR incidents happening in future. FINALISE ALL THE THINGS!

Image Credit