
The following article was written by Alex Pinto, Aventus’ Cryptography subject matter expert and Engineering Manager for the systems team. You can find out more about Alex at the bottom of the page.
Debugging is an essential part of the development process. It is an inescapable reality that code will have errors and developers will have to investigate unexplainable issues sooner or later. Debugging can often be referred to as an Art, but much more than that, I think it is very much part of science.
I’ve sometimes seen developers ignoring how to proceed when investigating issues. Particularly, if they’re young or at the beginning of their careers. At this stage, investigating a mysterious problem can be much more daunting than writing code in the first place. There are unknowns, no sign posts to follow, no algorithms to study or apply. Every bug is basically a new endeavour, and we may feel like there isn’t a structured process we can follow
In this document, I want to show that is very much not the case. Debugging can be fully done with a scientific mindset, the same approach that has been used all over science for centuries to uncover the laws governing our world. Rather than an art, Debugging can be a pretty accurate instance of the Scientific Method, and I would like to make that evident here.
The steps in the Scientific Method
The Debugging Process
Observe problems and ask WHY
The first step is noticing a problem. Observe that something is not going as expected. Sometimes, you don’t have an option, and a bug identified by someone else is thrown on your lap.
But as developers, we have a duty to monitor our products, see them running and note if they’re acting as we expect.
The most important thing at this stage, is observe things that don’t seem to conform to our expectations, and ask Why?
This is the most powerful question in science, and the essential beginning of the scientific mindset and any investigation.
Gather Data for Problem Solving
As soon as you identify a question about something that needs fixing, your most immediate task is to gather all the information you can about it.
First, note the error message. This is the main clue. If the programmers have been competent, it will at least point you in the general direction of what is wrong. It may even point you to the right place in the code.
Unfortunately, in many situations, where code has several abstraction layers, this may just be a surface warning. That is why you need to collect as much more data as you can.
- Can you reproduce the error?
- When does the error happen?
- What are the conditions of the environment when the error shows? This includes networking conditions, disk space, configuration details, etc.
If you can change and recompile the code, add print outs to critical variables. Make all the possibly relevant information visible and write it down. If not, try to change the log level and re-run in order to get as much information as the developers have made available.
Most importantly, try to make the error reproducible. I can’t stress this enough.
Print outs are generally useful, but rather old school. In complex code, with lots of loops or deep function calls, they can also be confusing. In these cases, you may be better off using breakpoints and live debugging. But even in this case, try to make a point of copying down relevant information about the state of important variables.
Formulate a Hypothesis
When you have enough data, hopefully you’ll start having some ideas to what can be causing the problem. At this stage, you should formulate a hypothesis that might explain it. Maybe you’ll even have several.
It is possible that some of the data you’ve collected is irrelevant. Some may be unrelated and lead you up a wrong path. It takes experience and knowledge of the code base to know these. But at any rate, settle on a hypothesis that explains the error and all the relevant data you have collected.
Create a Plan for Testing the Hypothesis
Prepare to test the hypothesis. This depends on the actual problem. It may involve a code fix, or changes to the setup, or changes to the configuration. It may involve preventing some conditions in the environment.
In this stage, you should write down a list of actions that, according to your hypothesis, should prevent the error from occurring.
Testing your Hypothesis
You have your plan. Now it’s time to test it. Execute the changes you’ve set out to do and run the experiment again. This requires that, in step 2, you’ve actually found out how to make the issue reproducible. If you haven’t you won’t be able to test the hypothesis.
This is why “random problems” are so difficult to debug. The scientific method fails here.
Errors of this type usually include: timing errors, concurrency issues, intermittent environment conditions (eg bad connections)
Analyse the Results
Once you’ve run your experiment again, what happened? Did the error still occur? Is the issue fixed, or improved?
If the bug is fixed, well done, you’ve successfully applied the scientific method to solve it!
If not, do not despair. The scientific method does not guarantee success, and is an iterative and persistent approach.
Go back to step 2 and repeat again. The new experiment and its results provide more data, and will enable you to create a new hypothesis. Repeat this process until the problem goes away or is fully explained.
If you haven’t managed to solve the problem, but in any case managed to give a full explanation to why it happens, then you’ve made significant progress. Now, it’s time for business and engineers to decide how to tackle it.
Perhaps there is a workaround. Perhaps the impact is just low. Make a risk assessment and bring the results to the table. And let planning decide what to do with the issue. With any luck, it can even be discarded.
But as for your exploration work, that one is done.
Congratulations, you’ve successfully used the Scientific Method to explain and resolve a software bug!
About Alex Pinto
Alex has a masters in software engineering and a PhD in Computer Science specialising in Cryptography and Complexity.
His post-doctoral research focused on public-key cryptography and database anonymisation, including two years as Post-Doctoral Research Assistant at the Information Security Group at Royal Holloway.
In 2018, he joined Artos as Blockchain Engineer, and then progressed to Research Lead developing Zero-Knowledge solutions. Since 2020, he is the Engineering Manager of the Aventus team that develops the AvN and the ecosystem around it.
You can find more about Alex on his website.