Monday, 16 December 2024

Side Effects, Unintended Consequences or Happy Accidents?

Some Language

"Side Effect" is very much a loaded term. If I'm going to implement something as a technology leader and I talk about side effects, this inevitably starts people thinking about the (almost always deleterious) side effects of medication. It is fair to say that if a medication intended to fix a symptom, or a disease, also happened to make you less likely to have a heart attack, it would not be deemed a "side effect".

What about "unintended consequences"? Again this has negative connotations. Mostly, at least in my head space, "unintended consequences" tends to be used in conjunction with some kind of government policy. For example, we in the UK constantly hear about the unintended consequences of some change to tax policy. If the government moves to close some tax loophole, the idea is to raise more tax. In practice there are often unintended consequences, such as people relocating to a different tax jurisdiction in order to avoid the tax rise, which means that the Exchequer doesn't benefit at all from the change.

What about "happy accidents"? I'm not sure how ubiquitous The Joy of Painting, presented by Bob Ross ever was. It must have been pretty popular since, according to IMDB, it ran for 31 seasons from 1983 - 1994. In each episode the artist painted a picture, went through it step by step (making it look really easy), before signing the masterpiece at the end. I am no artist, so I can't speak for how easy or difficult it ever was the reproduce what Bob Ross did, but I loved the program. 

One of the things The Joy of Painting was famous for was Bob Ross's "happy little accidents", where he would do something non-deliberately but would always somehow work it into the painting so that it looked like it was always meant to be there. There is nothing negative to say about the Joy of Painting, or the late, great, Bob Ross, but I think I'd struggle to use a phrase involving the word "accident" in an industry where we have to show that our deliberate actions led to measurable value.

My computer setup happened by accident when it turned out my client computer (left) wouldn't let me install my preferred browser. I got another laptop stand and put my own machine next to it. The result, a "happy accident" gave me more valuable screen real estate and extra search functionality.

Secondary Benefit

So I spoke to a few people at work and asked how I should describe something that wasn't the directly intended consequence of an action but turned out to be useful. We got into a discussion about Secondary and Tertiary benefits. I wondered why we were talking "tertiary" instead of just "secondary"? It seems my colleague was arguing that secondary was something that might be called out and be subject to governance and tertiary could be something that is either not measurable or simply not measured. The governance point made a bit of sense in our client's context but felt a bit specific to our consultancy context. So in my head it feels like Secondary Benefit is a good phrase to use to describe something that was a good (possibly planned, or at least anticipated) thing that also happened as a result of the work.

Conclusion

Language and labels matter. This can influence how people think about things and those perceptions matter. When delivering things, we should attempt to convey the notion that all consequences of our actions were (hopefully) intended, but at least anticipated.

I will be using the phrase "Secondary Benefit" to describe things that were good consequences of a deliberate thing from which we derived a "Primary Benefit".


Thursday, 24 October 2024

Effective Incident Response - Finalise all the Things!


The Problem

I was in my first post-consultancy gig and as well as having to deal with another dose of imposter syndrome (much as when I joined Thoughtworks back in the day) and problems within my own teams, it was also clear to me that the organisation wasn't dealing well with incidents. Whilst the response time to incidents seemed reasonable (although I had nothing concept of what "industry standard" may mean) and the time to restore service also seemed good (with the same caveat), it was clear that the overall number of incidents wasn't really going down.

The Analysis

A long time ago, back on one of my earliest engagements for Thoughtworks, myself and a colleague designed a workshop to persuade a client that they needed to invest in their build pipeline and general devops maturity. Our approach was to analyse how they handled incidents (of various types) and we segmented the process into Discover / Analyse / Fix / Finalise. I even came up with an acronym "DAFF" to describe this process. In that case, we weren't trying to influence their incident handling, rather we were trying to raise their (what we would probably now call) DevOps maturity. That earlier post can be read here.

I was told that we had a "mature" incident handling process and, on the face of it, this seemed a reasonable assertion. There was a weekly meeting to discuss incidents and schedule the washup meetings, teams seemed to be creating tasks and stories on the back of those meetings and they were mandated to ensure that stories that arose from incidents had to be prioritised. So surely the process was really good and should get good outcomes? Certainly that was the assumption that the CTO was taking.

So I then examined some of the stories that were on the Infrastructure team's backlog that arose from incidents. Again, they all seemed reasonable. Things like, "this EC2 instance ran out of memory, so redeploy a bigger instance" or maybe, "We got a burst of traffic which we couldn't handle (which may happen again) so we increased the size of the auto scaling group". All good, right? Well, not quite.

The Washup

I thought the only way I would understand things better would be to attend the washup meetings so I started asking if I could be invited.

Before I dive into the specifics of the example that made me realise what wasn't happening (and recall my DAFF cycle from several years earlier), it is worth pointing out that this particular company was fairly mature (certainly for a Fintech) and was carrying a lot of legacy. In particular the "original" implementation of the "platform", a big ugly old monolith, was gradually being strangled away by new services. It was still in existence and was conceptually a microservice (albeit a big one), and there were dozens of "new" services circling around it. I wasn't close to the process of "sharding off" services from the old monolith so at this point I had no real idea of coding standards, methods of reuse etc.

The first wash-up meeting I went to (it may not have been but it helps this narrative if we pretend it was the very first one) was about an system wide outage caused by a (still monolithic) Postgres database being unavailable (the code monolith was being split, it was "too hard" to split the database monolith, this is a whole other subject!). 

The call started well. I was impressed at how the incident had been handled. An incident Slack channel was created, the right people were pulled in, the person handling the incident maintained an accurate timeline of discovery, analysis and the fix, right up to the point where service was restored in a very timely fashion. Then we moved on to "what do we do next?"

The root cause analysis was gone through and the conclusion was that one of the new services accessed the database using some code that mishandled the database connection pool. Something to do with not releasing a connection back to the pool, therefore more and more connections were being initiated, none were being disposed and of course this eventually caused the server to refuse new connections. So the group agreed that the team responsible for the service that caused the outage should take on a story to "Fix Connection Pooling..." and everybody prepared to leave the call.

"Wait!", I interjected. "I think we've answered the question, 'how do we stop this service causing this outage?', but have we answered the question 'how do we ensure that other services won't cause a similar outage?'" To my recollection there were a few sighs but most in the call understood the point. So I asked if there was a common module that was used to access the Postgres monolith? No. OK, is there any guidance on how to write database connection code? No. OK, so how do we know that this issue isn't waiting to happen in other services? We don't.

The Finalisation

So that discussion led to a quick finalisation accord. That is to say, we agreed we needed to have a common module (or service) to access the database. I don't remember the exact implementation but the epic title was something like, "Standardise Access to the Platform Database". So then we were confident that this type of outage won't happen again.

One small corollary was that we then had a (leadership level) debate about ownership of common modules. The company, when I had come in, had a model of "shared ownership". Sadly, as is often the case, shared ownership largely played out as NO ownership. It was clear to everybody that this type of core functionality needed to be owned (not shared) but it wasn't clear by which team. In the end our conclusion was that the SRE function should own this piece and other things that are considered key to the reliability of the solution.

Lessons Learnt

There should be two main takeaways from this story:
  1. Every incident should have a washup that results in actions to make the system more resilient.
  2. Drill down into the root cause, AND ask one more "why" when you get there. As in, "this code here caused the problem - WHY is this code like this, and should it be HERE?" in order to better understand how to stop SIMILAR incidents happening in future. FINALISE ALL THE THINGS!

Image Credit