Wednesday, 26 August 2020

Pascal's Big Bang

The Big Bang Cycle of Fear

I have written and talked in the past about the Big Bang Cycle of Fear (similar to its cousin, the Tech Debt Cycle). This is a cycle that has happened in countless places over the decades. Its end game - long cycle times, lots of failed releases, low automation and general fear around changing anything - is the state that I have been introduced to at the start of many engagements as a consultant, first for ThoughtWorks and now for Codurance.

The Big Bang Cycle of Fear (extended version) goes:

  1. There was a "bad" release that caused some production problems.
  2. The response is "we have to do more tests" but instead of automating tests, because the codebase is in such a state that test isolation is hard to achieve, we add more manual testing process and possibly more "testers".
  3. In order to be able to "test everything properly" before each release, we do a "code freeze" after a set period and then assign a set period for testing of that "release candidate".
  4. At the end of this testing period nobody is confident that there isn't a problem but the "regression packs" passed with only minor issues so it is probably OK. In any case, the product owners are screaming for the release of "my feature" to go ahead.
  5. Ostensibly to discuss the risks associated with the release and therefore whether the release should go ahead, we have a long and costly meeting with all the team leads, the product owners, some key developers, the head of engineering and possibly the CTO. In reality the purpose of this meeting is to share the blame for the failure when it comes and thus ensure that no single person is blamed and possibly fired.
  6. The release goes ahead.
  7. There is a production disaster and we go back to step one.

Understanding the Issue

The first step to solving any problem is understanding that you have a problem. I think most people involved in a BBCF would recognise that they have a problem. But would they be able to identify the probable root cause of the issue? I've recently read some books about systems thinking and one of the insights in there was that people normally look for solutions to problems at the point at which the problem manifests. I think in the case of our cycle, it isn't clear where the problem happened because it is a cycle. Thus, it might not be clear where the best place to look for a solution will be.

Of course, most organisations flounder around for months, possibly (even probably) years not understanding that they are causing their own problem. A things goes wrong in production and they blame the technology part of the organisation. Of course, they do, it is a technology problem isn't it? Sadly all too often the response of the CTO is also to not understand the issue. So instead of tackling the real cause or fixing the real problem, they decide that they have to do more testing, which manifests in some common anti-patterns.

More Manual Testing

Sometimes an organisation will hire more "testers" for their "QA team", not realising that you can't test quality in after the event. Maybe they insist on a longer code freeze to make sure that they can test everything, not understanding that by doing this, they are exponentially making it it harder to test effectively. The worst way that this gets done is when the organisation already outsources its testing (we do development in house, of course!) because it is seen as lower value, and thus they hire dozens more people at the outsourcing firm, probably several time zones distant, ensuring that any feedback loop is automatically extended by at least a working day.

More Automated Tests

A seemingly more sophisticated response is to decide that what is needed is more automated tests to cover the existing functionality. They have no people available to write these tests, or the skills needed, so they hire a "Test Automation Consultant" to write a suite of tests. This seems on the face of it to be a sensible short term expense to solve a long term problem and it may even appear at first to be effective, a classic anti pattern trait. The trouble is that these "automation test consultants" leave once they have written the tests, nobody will then maintain the tests, they will then start to fail as soon as changes are introduced. Now the company is stuck with pipelines that won't complete and which probably take hours to run because they are "end to end" tests, because that is what you asked for.

How Many Things Can Go Wrong?

A couple of years ago I was working as an adviser for a company who had undergone, and were still undergoing, this big bang cycle of fear. Instinctively it feels wrong to make a big release. The more changes there are to release, the more things can go wrong. I noticed that the testers (separate from the developers) had added tests at each new release which were by hand tests, so each successive release necessitated more testing for any subsequent release. This would have been OK, of course, if the new tests added were all automated and could be run in a reasonable amount of time. Sadly, they weren't automated and so they couldn't be run in a reasonable amount of time. So straight away they had introduced a linear scaling problem.

New Problems

The strange thing, from the perspective of the managers at this company, was that even though they had enough people to run all of these "regression packs" at each stage and thus, they thought, verify their existing functionality, they still experienced strange, unforeseen issues. "How can this be?" they asked, "when we are running all of these regression scripts?" It was clear to us that the answer was simple: over time the design of the system had degraded to the extent that there were all sorts of tight, loose and temporal couplings between the various parts of the system. Nobody had a sense for what these couplings were and nobody had any idea about what they were intended to do, let alone whether they were doing it correctly. So clearly they weren't considering enough things in their testing. They needed to consider interactions between the things.

My Mathematical Question

I went to a meeting with all of the senior managers just after another failed release in which they had released in the region of 30 different things. As usual they had gone through their ritual of blame sharing, which they called "Go / NoGo" so therefore they were safe in the knowledge that no single person in the room was going to carry any cans. In fact, the interactions between the different parts of their solution were so unpredictable that they didn't even know which "thing" had failed. They just knew the release had caused issues.

I started with a simple question:
If I release one thing, how many things can go wrong?

Clearly, the answer here is 1. So I followed up with another simple question:

If I release two things, how many things can go wrong?

At this point there was a bit of a murmur in the room but the general consensus was that two things could go wrong. "I beg to differ", I said, "Thing #1 can go wrong, Thing #2 can go wrong OR the unexpected interaction between Thing #1 and Thing #2 could cause a problem." Nobody argued with this because that was exactly the experience that they had been having.

So the next question I posed was, naturally:

If I release three things, how many things can go wrong?

Now, the maths starts getting a little complex because at this point each of three individual things can go wrong, Each or three interactions between pairs can go wrong or an interaction involving all three things can go wrong. So we now have 3 + 3 + 1 = 7 places to look for our problem. I could see the group starting to appreciate where I was going with this.

Pascal's Triangle

So naturally, my train of thought led me to think that there must be a simple formula that tells us how many things can go wrong when you release N things. At first I thought I was looking at the formula for adding the first N numbers, that is to say N(N + 1) things, or O(N2), which is bad enough, but then I realised it was even worse than that.

You may remember Pascal's Triangle from studying maths at school. My recollection of it was that it was introduced to illustrate the Binomial Theorem. I realised as I was going from 3 things to 4 things... to N things that what I was seeing was the successive rows in Pascal's triangle:


So if you start from row zero (the single 1 at the top) and number the successive rows, then row 5 contains the numbers {1, 5, 10, 10, 5, 1}. If you ignore the initial 1 (which can be regarded as representing number of cases where no things interact with no other things and thus can't be the cause of any issue) you can see that there are 5 single things that can go wrong, 10 pairs of things interacting that can go wrong, 10 triplets interacting that can go wrong, 5 sets of 4 things interacting and a single set of 5 things. So we can see that if we release 5 things, there is a potential 31 things that can cause us problems.

As well as representing the coefficients in a binomial expansion the numbers in the Nth horizontal row of Pascal's Triangle add up to N2 so given that we know the first 1 is irrelevant for this discussion, the conclusion is that if we release N things, we are causing a potential N2 - 1 issues and therefore need to test all of those things to be sure that our release will not fail. Even if we could rule out certain interactions and therefore reduce the overall universe of possible interactions, the number of things that can go wrong, and therefore could need to be tested, when we release N things is still O(N2) and thus of exponential complexity. So the scary answer for my client back in 2018 was that if you as a group agree to release 30 things simultaneously then you need to test 30- 1 = 899 things. It is easy to see that you are stretching the limits of human scalability and thus why your releases fail every time.

What is the Answer?

The answer is simple and, to a lot of people, obvious. Release one thing at a time. This means reduce your batch sizes until you are doing continuous delivery. Find a way to gain confidence around your changes so that they can be released as soon as they are "done", not weeks later. This could mean you do no new work (or reduced amounts of work) while you pay down technical debt, it could mean assigning developer effort away from new features but probably most importantly it HAS to include taking a long hard look at your Work in Progress (WIP) and aggressively reducing it. 

How can you do this and still deliver value steadily? Well, for a short period of time you can't. You have to accept that your rate of release of new features will reduce. BUT, ask yourself how often you have released new things without causing new problems and then ask yourself what is your REAL rate of return of value to the business? The calculation will be different for every system but ultimately if you don't gain confidence around your working system your throughput will eventually grind to a shuddering halt.

I think the desired end point is clear. We should all be practising continuous delivery which means tiny releases very often. Depending on the current state it could be easier or harder to get there or it may not even be possible without some kind of large scale software modernisation program. Hopefully most people will see that particular state coming and do something about it before it arrives.

Conclusion

The bigger your release the more likely it is to fail. The combinatorial mathematics shows this.

Use Pascal's Big Bang to demonstrate to stakeholders the fallacy of adding extra testing cycles.

To understand your real throughput consider releases as complete only when they have no remaining issues.

Reduce batch sizes aggressively until you are comfortable with continuous delivery.

The best form of cure is prevention! Don't suffer from boiling frog syndrome. If your levels of technical debt are increasing, slowing delivery and causing problems for releases, don't allow your company to fall into one one of the common anti-patterns that inevitably lead to the Big Bang Cycle of Fear.

No comments:

Post a comment