What if COVID-19 were a JIRA ticket? — A Lesson in Bug Prevention.
I saw a tweet the other day that said, “the scariest thing about COVID-19 is that it looks like a JIRA ticket.” That is scary!
Gets you thinking though: What if COVID-19 were a JIRA ticket for a software bug? Can we unearth a lesson or two about how to prevent bugs from sneaking into our haloed code repositories and affecting our users?
the scariest thing about COVID-19 is that it looks like a JIRA ticket
— Molly "putting the ‘viral’ in ‘viral’" Waggett (@mollywaggett) March 4, 2020
It’s 9:17AM. You’ve just returned from the coffee corner, all smiles, like a big golden sunflower. Setting your mug to one side, you begin energetically logging into your machine. You are a toe tapping, coffee drinking, head bobbing sunflower coder!
“Ugh! What!?” you exclaim, wilting on impact with the unexpected and urgent ticket assignment.
COVID-19: Bug. Priority high. Causes difficulty breathing and sometimes system termination.
You heave a bang blasting sigh of frustration and begin unconsciously rubbing your left shoulder. This was not on this morning’s agenda. The thought of trying to rustle up a quick-’n-dirty fix is tempting, but you can’t — you just care too much. Nope, it’s time to clear the schedule and roll up the old debugging and postmortem sleeves.
First, a little more investigation is in order. For example, what’s the deal with COVID tickets 1 through 18?
Examining each ticket in turn, you start to feel like you’re binge-watching a single episode of Murder, She Wrote on repeat. Each ticket may as well be a duplicate of the previous one. Similar symptoms, similar resolution.
After four hours and numerous trips to the coffee bar (and its business partner, WC), you’ve accumulated a few notes scribbled decoratively around the coffee ring on your notepad. They may prove useful at the next Retrospective:
- Patching symptoms without addressing the root cause?
- Are we overly tolerant of software regressions?
- Overspecified tests?
- Unhelpful error messages
- How did this make it to PROD?
- Copy-paste coding?
Let’s expand on those findings.
Find the root cause
The fact that the same or similar bugs keep resurfacing likely indicates that we’re not getting at the root cause of this bug. Rather, we’re blindly throwing patch after patch at the symptoms, hoping something will stick. But that’s dangerous — we’re probably introducing new bugs on top of the old one.
Don’t settle for the first tweak that makes the error go away. Make sure you’ve really fixed it.
A telltale sign you haven’t really fixed a bug is that you can’t explain how you fixed it. If you think you fixed a bug, revert your fix and test again. Is the bug still present?
Don’t tolerate regressions
When the same “bad penny” bug, or a certain class of bugs, keeps resurfacing sprint after sprint, you’ve got a software culture that’s overly-tolerant of regressions.
Maybe the team has grown numb or callous to the big-picture consequences: “Oh well, just another SQL injection vulnerability.” And you patch and move on.
But don’t — Don’t tolerate regressions.
Automated testing a must!
To prevent regressions, the first thing you need is an automated suite of regression tests, and it needs to run on every single build. When it fails, the build fails.
If someone sneezes on the Git repo, at least the bug won’t make it to production.
When you find a bug, don’t just fix it. Write a test to make sure it’s gone. Then add that test to your regression test suite to make sure it stays gone. If someone sneezes on the Git repo, at least the bug won’t make it to production.
Preventing some classes of bugs requires developer training — buffer overruns and the aforementioned SQL injection vulnerabilities, for example. But even here, there are static code analysis tools can help, but only if you actually run them, ideally with your CI pipeline. If someone codes up a SQL injection vulnerability, it’d be nice to break the build instead of the bank.
Write quality tests
Sometimes there are automated tests in place, but they catch few bugs, if any. Sometimes it’s because they are overspecified, testing narrow implementation details instead of true behavior and domain logic.
Test code is the code that has your back
Interestingly, poor quality tests often exist alongside high-quality feature code because test code is viewed as a second-class citizen and isn’t given the same level of care and attention. But test code is the code that has your back. Be nice to it.
Additionally, poor quality tests that are always breaking for silly reasons are likely to be disabled or deleted — the “test that cried wolf.” For example, a test breaks every time the system “coughs” likely isn’t a useful test because a “cough” is too general a symptom. The ideal test fails for one, and only one, reliable reason.
The test that cried wolf
Write quality error messages
About those vague and overly-general symptoms… Yeah, super unhelpful. They could be caused by just about anything. It’s like getting a “tried to insert a duplicate key in the collection” error. Unless you’ve only got one dictionary in your entire codebase, you’ve got no clue about what’s throwing the error until you break out your microscope, er…, debugger, and step through the code like a grumpy spelunker.
Debugging unhelpful error messages takes time. Meanwhile, suffer the stakeholders.
When you’re writing a class, a method, even a single line of code, try to foresee what can go wrong. Throw a helpful exception message or write an explanatory log entry. A few minutes spent adding an informative error message can save your future self hours searching for the source of a bug.
Keep it out of the wild
Let’s say the bug stemmed from a type safety issue in an untyped language, lurking unnoticed in production for weeks before pouncing on an unsuspecting passerby.
How did this bug get into the wild in the first place? How did we tolerate it for so long? Why didn’t we detect it earlier?
Two practices that could have helped here are code reviews and monitoring.
Review that code
How did this bad code make it past code review? Was a proper code review process even in place? Are overworked code reviewers just rubber stamping everything?
Take code reviews seriously. — And that goes for the reviewee as much as the reviewer. Make sure your pull requests are easy to review — small, focused, understandable, and documented.
Monitor it
If an issue makes it past our hawk-eyed code reviewers, we may yet catch anomalies in the production system before they reach pandemic proportions, perhaps with a regimen of monitoring supported by a healthy dose of synthetic transactions.
Careful with copy-paste coding
Sometimes an organization has adopted a culture of copy-paste coding. It seems developers are hurriedly copying and pasting unvetted code from Stack Overflow without sufficient understanding or testing.
The solution here is additional training in code hygiene and even a small dose of cynicism. Stack Overflow and friends are wonderful, but their code samples should be treated as hints to the solution rather than gold standard production code.
The aftermath
After analyzing the ticket, you realize that the a true root-cause fix for the bug is likely to be extremely difficult given the resources available to your organization. At the least, you’ll need to pull in a few more developers from your team. More likely a fix will require a coordinated effort from the entire organization.
You document your research and findings so that you’ll be better prepared when COVID-20 rears it’s ugly round head. And one thing’s for sure — You’ve got a lot to discuss at the next retrospective!