Had a discussion recently with my friend Richard Braakman about debugging. Here’s a lightly edited version, in case it is of any use for others.

Richard is awesome at debugging, and I’ve often suggested he should employ himself as a consulting debugger, the way Sherlock Holmes was a consulting detective: to be called in when the usual experts are stymied.

The context of this discussion is that I’m working in my free time on a new system, named Ick, which will have a server component, and I want to make it easily debuggable. So of course I asked advice from the best debugger I know.

L: What would your top three hints be for someone who needs to learn to debug better?

R: Hmmm… obviously it depends on where they are now :)

L: I’m asking for myself primarily but also collection ideas in case I do a talk on this some day.

R: One. Find a way to observe the problems. By which I mean, instead of massaging the starting conditions to make hypotheses, actually follow it step by step through what goes wrong. See what it’s actually doing. This might mean using a debugger, or strace, or tcpdump, or adding logs, whatever it takes. (When doing so involves extra work it’s often tempting to leave it as a last resort. I’ve found that it pays to resort to this early.)

Two. When you conclude that the behaviour you’re seeing is completely impossible, don’t give up. You’re actually getting close. One of your assumptions is wrong. When you find which one, you’ll have the bug.

Three. Try to duplicate the buggy behaviour on another system. (This is easiest if you already have one - the “it works for me” case). Then systematically remove the differences between the working and non-working systems.

If you ask on another day you might get a different top 3. :)

Something I often say is “make sure the code you’re reading is the same code you’re running.”

L: Interesting. Thank you. This helps me. New question: if I were building a program that you may need to debug later, what would you like me to provide to help you debug?

R: Heh. An easy way to get the debug log. I often find that logging has been turned off in multiple ways. I have to figure out how to compile it in, then how to turn it on, then how to make it go somewhere else than /dev/null…. and some programs have different debug logging frameworks in different parts that have to be turned on in different ways. It’s a pain.

Sometimes there is actually a simple way but it’s not documented, and I have to read the code or google for it.

Sometimes the way to get logs keeps changing between versions. (Looking at Qt.)

L: Haha. Q: would it be useful to have highly detailed debug log for recent times but the rest reduced the level of detail for older stuff? Or would that just be complicated?

R: Do you mean for recently written code?

L: Recent in terms of of execution, so much detail for the latest hour, but less for older logs.

R: Hmm. I had one program where I logged debugs logs to a memory buffer… as soon as there was an error or warning log, it dumped the whole debug log too (and wrote the debug log directly for the rest of the transaction). If the transaction was error free, the debug log buffer was discarded.

It works for server-like programs.

I think time based log purging wouldn’t be good in my current type of jobs, where I often see problems only the next day etc.

L: Was that useful? Should I consider that for Ick?

R: Yeah it was useful :) It meant we got detailed logs of problem cases, without overflowing the log files. We had a high transaction rate and the logs were several GB a day.

L: I shall keep it in mind then :)

R: I guess it’s less useful if you have no capacity problem :)

L: Speaking of logs, what’s a good log like?

R: One thing is, it helps if the log text is unique in the code. That way I know for sure which bit of code generated the log.

Hmm, when I modify the logging, it’s often to add the actual values the code is having problems with. Like instead of “too many entries”, log how many there were and what the limit was.

Same with stuff like “unknown command code”.

Including those values is often extra work and I don’t actually know if it pays to do it in advance :) I just know it’s the change I most commonly make.

L: Would you say log message should make sense without having the code in front of you?

R: I think that’s a “sometimes”.

Like… if the log message is like trace logging, for example if it’s logging the disposition of a packet, then the trace should be understandable without looking at the code. But if it’s logging something internal, like the workings of an algorithm, then I’m probably looking at that code already.

L: Nod, good point.

What’s a good way to control logging? As in where it goes and how detailed it is.

R: That I don’t know :) All the ways I’ve seen annoy me.

I think my favourite is to generate a separate debug.log file next to the normal logs.

It all depends on whether capacity is a problem.

Hmm a separate point is about consistency in logging. Like, if there are five reasons to reject an input and only four of them have logs, then I’m likely to miss that fifth one and assume “no log means input was accepted”.

L: Nod.

What’s your preferred format for logs?

R: I don’t think I have a preference :)

L: Do you like logs that are easily processed by programs?

R: Nah, at most I use grep. But the people who make intrusion detection systems prefer that :)

Same for system monitoring I guess.

L: Check