Alan Donohoe

January 14, 2024

But What If It Does Go Wrong?

You are launching a complex change that has ramifications across the system. 

A change so complex that is difficult for a single person (you), who is leading this project, and even the whole team, to possibly know all the dependencies and side effects. You of course follow all the best practices to make sure you are releasing into production with confidence.

  1. You've written your code to the best of your ability ✅
  2. Your code is fully test covered, with at least unit, and likely integration, tests ✅
  3. You've had your peers review every change merged into main via PR Reviews ✅
  4. With the new code live in a pre-production environment, the system-level integration test suite passes ✅
  5. You've had QA manually test it in staging and then pre-prod/UAT ✅
  6. You ask all the software engineers in charge of each service this change will affect to test their service with this change live, to ensure there are no side effects ✅
  7. You've walked through a domain expert using the new code in all user-facing web apps, in a pre-production environment, making sure they can perform all their use cases, that even a QA may have missed ✅

At this point, you and the team are confident this is safe to launch into production. 

And eventually, it is released. 🎉 🎉 🎉


But what if it does go wrong?


Maybe there's some race condition, some side effect, something that was impossible to be aware of and detect, even with all the above testing, that might appear only once it's released into production.

How do you deal with that?

Ask yourself:

"If this thing does go wrong, what does that look like?"

And then monitor for that condition in production. Before you release.

Here's an example:

Recently, I was leading a complex, multi-month, multi-disciplinary project to migrate from one data provider to another, say from: 

"data_provider_legacy" -> "data_provider_new"

This involved a complex migration process, where we wanted to silently generate data from data_provider_new, while only serving back the data for data_provider_legacy, to allow the analysis of data_provider_new's data offline.

I performed all the above steps to make sure we were confident that the code change was safe for a production release. And so, it was released.

However, a database query that wasn't strict/ exclusive enough, resulted in a race condition that was very difficult to detect even with all of the above testing, that only manifested in production.

It meant that in 16% of the requests for data_provider_legacy data, we were responding with data_provider_new (and untested) data. 

This meant, in these 16% of cases the system behaviour was undefined. In production. Not good.

If I had asked, before release, while I was working on this code (while it was all as fresh in my mind and as understood as deeply as possible):

"If this thing does go wrong, what does that look like?"

Then I could have worked out how to look for that in production, set up monitoring for that before releasing, and discovered the issue the first time it occurred - affecting < 1%, not 16%, of cases.

I could then turn off the integration with data_provider_new, and fully investigate what was causing this, until now unforeseen, issue. Then turn the integration back on when it was fixed.

In the main story of this work, you can include a ticket that captures this: 

"Post Release Monitoring of Production: 'If this Thing Does Goes Wrong, What Does That Look Like?'" 

Typically this will be a couple of hours of work to think about and set up some sort of monitoring. There's no need to be clever here, this is just a temporary set-up that you will discard once you are confident nothing is going wrong in production.

In my case it would have been me, manually running a DB query every morning, on production applications, looking for those with:

{"data_provider": "data_provider_new"}

It would take about 30 seconds every morning before my stand up. That would have done it.

This way, you are: 

  • Realistic about the possibility of things going wrong when launching complex changes into even more complex systems.
  • Whilst taking responsibility for monitoring these complex changes post-release into production.
  • And then rectifying the issue as soon as it is detected.

Who in your team, wouldn't want you to do that?

Maybe a time, and resource, pressed PM?

You can point your PM to this blog post when they ask why are you wasting time writing up and implementing this ticket when the change has been so thoroughly tested already, and ask them:

"But what if it does go wrong?"