If you work in a code base for long enough, you’re bound to encounter some real head-scratcher bugs. The best technique I’ve found for diagnosing, and ultimately fixing tricky bugs is to create a reproduction.
Reproductions can be hard
Let’s talk about why.
There are a lot of variables in play when trying to repro an issue. Obviously you can’t simulate the entire universe exactly the way it was when the user encountered the problem, so it takes a bit of artistry to home in on the variables you want to emulate for your repro. In general you can start with a vaguely similar environment, and then match the real environment more tightly until:
A) You achieve the reproduction.
B) Simulating environment conditions in your repro stops giving you useful information about the nature of the bug.
Reproduction strategies
Again, this is more of an art that a science. Your system, your tech stack, and the nature of the bug will all influence how you create a repro. Here are a few flavors that I find useful.
YOLO - Repro on prod (maybe)
As long as the bug isn’t doing harm (compromising data, bringing down the system, creating more cleanup for the team) it may be fine to just try doing what the user did right on production… assuming you have a production account where it’s okay for whatever the repro action is to either succeed or fail. If you are an HRIS software, you probably don’t want to repro a salary change on prod… but if you are a screenshot saas, it’s probably fine?
Mimic the user
Repeat the steps they took to expose the bug. But also, mimic their system variables…
use the same browser as them
make your account preferences match theirs
give yourself the same permissions as them
Non-prod environments
Not all repros should be done on prod. While QA and ops teams all long for pre-prod environments that replicate production, it’s often the case that pre-prod envs are leaner or, at the very least, less trafficked.
In high-throughput systems, this can have a big impact, especially when dealing with recursion, concurrency, batching, or latency sensitive areas of your system.
Mimic prod load.
When mimicking the user’s actions and account state don’t produce a repro, a good next step is to look at load conditions of the system. Load can be tied directly to the user (the have 10x the volume of records than normal users), or it can be ephemeral (the bug happens at peak traffic time in a multi-tenant environment). In these cases, scripting and unit testing can really help. Write a script that inserts a bunch of records or hammers your service with traffic, then run your reproduction and see what happens.
Write a test
I’m not dogmatic when it comes to writing tests, but I find that they really shine for reproductions. Writing a test has the great side effect of expressly documenting the desired behavior. Ideally you can write the test, and see it fail while the bug still exists. This is reproduction! Once you patch the code and the test passes, you (probably) have a real fix.