For our book club this time, Daniel, Mark and I were joined by Vince to discuss an article about using machine-learning to test Firefox more efficiently by Andrew Halberstadt and Marco Castelluccio. A short summary follows. I should note that part of the article dives into maths, which is beyond my understanding, but we tried to understand what Mozilla is trying to achieve, even if we skimped on understanding the details of how.

Mozilla has roughly 85 thousand tests for Firefox, which is built for about 90 configurations (combinations of target operating system, build flags, and so on), and they get around 300 pushes to version control per day. That’s about 2.3 billion individual tests to run every day.

Running that many tests takes a lot of hardware resources, and so it has a real cost. Beyond the financial aspect, running all tests for every configuration for every push takes time and causes friction in the development process: other developers have to wait for your push to finish testing.

The goal of the work reported on in this blog post is about reducing the number of tests that are run without sacrificing quality, using machine-learning technologies. Previously, Mozilla has prioritised some configurations over others, and runs a smaller set of tests for some than others, or runs tests less frequently. They also integrate into a dedicated integration branch, and merge from that into the main branch manually, using dedicated people, “code sheriffs”, who make informed decisions.

The machine-learning approach is based on the realisation that one can deduce patterns from historical data: if this module changes, then if any tests fail it’s probably this set of tests. Thus, when a new change comes in, analysing what has changed can inform what tests to run, or in which order to run them. For CI, if the automated tests will fail, it’s better that they fail fast. Not only does this mean developers get feedback sooner, and can start fixing things earlier, but also the other tests don’t necessarily need to be run at all.

The book club group had long discussion about the article. Overall we found Mozilla’s approach to testing the Firefox browser impressive. We certainly understand the motivation to reduce number of tests run without compromising on quality. It seems to be an issue that many large software projects have to face: it’s one more thing to balance to meet various conflicting needs and requirements.

Large code bases are different from small ones. The sheer scale of things brings problems rarely seen in smaller code bases, and expose more problems in compilers, operating systems, and even hardware, than most small projects do. Large projects tend to also have more flaky tests (tests that sometimes fail for no obvious reason, but sometimes pass). Large projects may also have to make different kinds of careful compromises when tests fail, to maintain forward momentum.

We each had some amusing stories about how machine-learning can fail amusingly, but it seems Mozilla is doing things in ways that should avoid most of the inexplicable failures. We had some thoughts about how Mozilla might use their historical data and machine-learning more. For example, perhaps ML could identify flaky tests automatically: tests that fail unusually much, even when nothing seemingly related to the them has changed? Or perhaps identifying tests that become flaky?

Maybe ML could identify parts of code that could do with quality improvement: specific code modules that result in test failures unusually often? Or identify hidden, undeclared coupling between parts of the code: if this code module is changed, then tests for this other code module fail unusually often?

Overall, we liked the article. It reports on work in progress, and we look forward to learning more.