Tim Hastings - NonHostile (because there's no need)

Weblog and collection of geeky articles.

  Home :: Who? :: Contact :: Links :: Subscribe subscribe
Ryan Photos - 6-8 weeksA Long Weekend in the LakesAbigail, 17 months old


The term 'bug' refers to a glitch in a system that causes unexpected or undesirable behaviour. As bugs can occur in lots of different types of system, this article concentrates on bugs that occur in software systems; although in some areas, my approach is general enough to transfer to fixing bugs in non-software systems.

Software bugs are an inherent problem in software development that can occur at any point in the software life cycle. Fixing bugs earlier in the development process can be significantly easier and cheaper than fixing them once the software has gone live. The easiest bugs to fix are found lurking in specifications before any code has been written; a pre-emptive fix.

Know Your Enemy
In Sun Tzu's The Art of War, Sun says:
If you know the enemy and know yourself, you need not fear the result of a hundred battles.
In your life as a software developer you are going to encounter more than a hundred battles with bugs, so let's meet the enemy. Not all bugs are the same. I am going to crudely divide all bugs into two camps, logical bugs and system bugs.
Logical bugs: occur when the system is working but gives the wrong result or behaviour. This means that there is a fault somewhere in the logic of the software. Fixing these bugs usually requires domain knowledge where the developer or an expert provides details of how the system should behave.

System bugs: cause software to break; this may mean hang, lock, croak, crap-out or blue screen of death. You get the idea. This is where a program encounters problems that have not been anticipated and for some reason the software is unable to proceed, having no choice but to fail. Of course, a system bug may not be the fault of the application and may occur in the depths of the operating system or within a third-party library. This is still a bug nonetheless. Fixing system bugs may not require domain knowledge, as the goal is to prevent the software from being broken.
The line I draw between these two camps is whether the program is giving correct result or not. Issues regarding performance and scalability I would not define as bugs as such, more like acceptance criteria. But if pushed, I would place them alongside System Bugs.

Step 1: Recreate the Problem
Unless you can recreate the problem, you will never be sure that you have fixed the bug. I cannot stress how important this point is. To be a good bug fixer, you must invest as much effort as it takes to recreate the bug on demand. This will results in a sequence of steps that may be very bizarre; but this is the test case.

To accurately recreate the problem, you may need to do some or all of the following:
Hunt for clues: anyone who's seen Scooby Doo knows that that this is essential to solving a mystery. A good bug report describes what the user was doing, how the software and computer is configured and ideally would include the fix. (It's very rare for users to submit the fix, but this can happen with open source software.)

Given a very vague bug report, I can't print, there may be a million possible causes, some of which may not relate to your software. It is not unreasonable to require a minimum level of detail in a bug report to help you determine the problem. At the very least, what piece of software (and version)? What were you doing that caused the bug? What error messages were displayed? Spending time with the user and asking them to show you what they did is great but not always possible. Teaching the user how to take screen shots of faults that occur will pay dividends.

Visit the crime scene: Many systems do not have human users, the user may not be around or may not be of much help. In this case, you should put on your shabby coat and visit the crime scene ala Columbo. Here you need to gather evidence. Look into log files, what was the system doing when it failed? If it was processing a transaction or some data, is this data available? What the exception logged? What was the error number or message? All this evidence will be needed when looking for suspects and eliminating others. Statements in log files are like alibis; they allow you to eliminate sections of code from your enquiry.

Wait for acts of randomness: We can no longer get away with blaming apparently random faults on cosmic rays or bit rot, and need to find a better explanation. Bugs that occur at random are difficult to recreate and because of this are the hardest to fix. Tracking down the causes of these bugs may take a long time and can often require enhancing logging to try to pinpoint the failures.
  • Can you pinpoint when, where and why the failure is occurring? Is enough data being logged?
  • Monitor the memory size of the process, steady growth implies a memory leak
  • Does the live machine have multiple processors? This may be a threading race-condition that may not be re-creatable on a single-processor machine.
Implement a new unit test: In an ideal world, you are using automated unit testing framework (like jUnit or NUnit) and you can implement new tests are suite of tests to repeatedly recreate the bug you are investigating. Before you start fixing the bug, you now have a new failing test in your project. Now you can treat this like any other failed test in your build. An additional benefit of implementing this new test will be future defence against regression.

Mirror like live It works on my machine! is a developer mantra. Whenever you hear this, you should reply: we don't ship your machine! If the software works in one place, but not in another then there is some difference somewhere. It is your job to determine what the difference is. This requires attention to detail. The usual suspects are: different versions of the software or related libraries, different test data or the amount of data, configuration of software or hardware, running the source code vs. built software, or single server vs. web farm. Take a closer look! What are the software's dependencies?

Step 1a: Prioritise The Bug - This step is optional and depends, how busy are you?
At the scene of any mass casualty, a triage establishes the severity of each case. Whilst not necessary to fix a single bug, rating a bug's severity helps prioritise the most important bugs when overwhelmed with raging hoards.

There are many different ways to prioritise bugs, and the best way is beyond the scope of this article. The following scheme will get you started. These priorities can be applied to both logical and system bugs.
Class A: (people are dying) or from a business perspective, we are losing money. This kind of bug is interrupting business. For logical bugs, the system is doing something wrong (think heart monitor or cash machine). For system bugs, the system is just not working or fit for purpose. Class A bugs are 'must fix' and must be fixed quickly.

Class B: (the handle has come off) The key difference between a Class A and a Class B is 'does a workaround exist?' If a user can take steps to workaround the problem, then the fault can be tolerated. These bugs are annoying for users. An example logical bug: there may be a navigation button that is supposed to appear, but the feature is available via the menu anyway. For system bugs the software may not work on a Wednesday, but will work on all other days.

Patient: Doctor, doctor, it hurts when I do this. [patient raises arm]
Doctor: Well don't do it then.


Class C: (clean-up on aisle three) Somewhere is your product there is a cosmetic problem such as a spelling mistake.
Depending on your business and what your software does, your tolerance for different types of bug may be very different. The decision of which bugs to fix and which bugs not to fix is a subject in itself. Deciding whether to fix a bug is balancing act between the number of people it effects, the risk of introducing new bugs, the cost of fixing the bug, and ultimately the cost of not fixing the bug. Eric Sink has written a very good article about the economics of this subject: My life as a Code Economist


Step 2: Locate The Bug
Can you recreate the bug? Really? If not, do not try and fix the bug, go back to Step 1.

Depending on how good the clues are that you've gathered, you may have a pretty good idea of what the problem is and how to fix it. Or you may not. The good news is that this has now become a War of Attrition. You have the upper hand.
Tracking it down: If you do not know where in the system the offending bug resides, you need to start eliminating the innocent parts of the software to pin-point the bug. The best approach is using a debugger, which allows you to step through the code and inspect variables as the code executes, but sometimes this is not possible. Some systems support verbose logging which can be switched on to aid fault finding, most systems however do not. Not to worry. You may need to modify a version of the software to output messages at key points so you can determine how far the code is running before failing. Depending on the type of software you are writing, your messages can be written to the console, a log file, debug window or rendered within the web page.

Once you have established a communication channel with the innards of your program, you can use this to output important variables and see which conditional branches are being executed.

Assert your assumptions: If you are having trouble pin-pointing the bug, you may need to consider what assumptions you are making and test them. This may require using external tools to diagnose or trace systems or writing code to make sure that these assumptions are valid.

For example:
  • Are variables being passed as parameters being modified?
  • Are you connected to the database server you think?
  • Is the browser accepting cookies?

Step 3: Fix The Bug
Nine times out of ten, once a system bug has been pin-pointed the cause and fix usually leaps-out as an obvious school-boy error, the sort of thing that would normally be found during a code review. These are usually caused by code that had made assumptions, weak exception handling, and so on.

Sadly, there is no such thing as an all-purpose fix; the right fix depends on what is broken. This is where I refer back to the two camps of bugs I defined earlier.
Fixing a logical bug requires domain knowledge. What the should the system be doing? This will involve going back to specifications (if they exist) to try and determine what is correct or asking experts. You may find that the behaviour is by design or it's a feature, not a bug. This may be the case, but these are also developer cop-outs. It may be that the original design or specification did not anticipate the scenarios that caused this bug. Sometimes, fixes require going back to the drawing board.

Fixing a system bug usually requires using the language or system libraries. If the bugs are found to exist in a third-party library or the operating system you may need to find an alternative way of doing what you want. If the problem does exist outside your software, try to recreate the problem in isolation from your software. Create a new project, if you can write the simplest case where this problem can be found, the provider of the library is more likely to help you with a fix. If you give them thousands of lines of code and say "your code's broke" they will likely not be so helpful.

Beware of side-effects: Fixing bugs in software is notorious for introducing new bugs. Developers often focus so closely on the bug in hand that they forget other scenarios that the code should support. Rewriting huge portions of software is very risky and should ring alarm bells as it may not cover the other scenarios. There is no silver bullet to this problem. You need to understand how this part of the system and its behaviour relate to the rest of the system and make sure that you consider this when making your fix.

Testing the fix: Testing that the bug is fixed should be straightforward given that it can be recreated. The harder part is making sure that no new bugs have been introduced. This is where anyone who has implemented automated testing can have a gold star and go home early. For everyone else, you have to stay behind and consider a test plan to cover the areas impacted by the fix.

Version control for source code: These tools are invaluable. They allow a developer to store the history of the source code and compare the differences between the different versions. When programming, you may break something and want to revert to the an earlier working version, or undo some changes made several days ago. Source control is what you need, and there are free tools available (check out Subversion). I think that most developers who have used version control are convinced of the benefits and would never go back. When used during bug fixing, these tools allow you to see recent changes and can help you check that you have only changed what you intended and to remove your diagnostic code or roll back failed fix attempts.

Summary
Fixing bugs goes with the territory of software development. Like any other skill, the more experience of software development and bug fixing you have, the easier it becomes and the more you can appreciate common pitfalls.

The process above describes the steps I take when faced with bugs and it works for me! Your mileage may vary.

In my next bug related article, I will look at approaches to writing software that make bug finding, fixing and maintenance easier. Until then, happy bug fixing!

del.icio.us | reddit | digg or subscribe

© Copyright 2006 Tim Hastings (all rights reserved)



Development, Saturday, November 4, 2006 11:54

Timeline Navigation for Development posts
MonoAmi: Hosting ASP.Net, C# and VB.Net with Mono on Linux in The Amazon EC2 Cloud (made 86 weeks later)
Software Development 101 - How to fix a bug (this post, made Saturday, November 4, 2006 11:54)
My Top 11 Usability Tips Gleaned From All Over (made 55 weeks earlier)