Fuzzing a calculator!

"Beware of bugs in the above code; I have only proved it correct, not tried it." — Donald Knuth

It’s obviously very important to ensure software behaves as expected. That’s what we’re going to focus on today: how to actually test a program? How can we make our software reliable?

The first and most obvious solution is to manually try all the features of the software. Two issues will quickly arise though. First, manual testing is very often strongly biased: one always takes some habits when using a program, and will inevitably forget to try some edge cases. Second, the amount of cases to try is most often virtually infinite. In the case of Epsilon, our calculator’s operating system, the user can press 46 different keys. Trying every single 10 keys scenario would require 42420747482776576 manual tests, which would take over a billion years if pressing a key every second. And 10 keys would be a rather short scenario…

Automatic tests

But how about generating random scenarios automatically, and getting the computer to run them? In the case of Epsilon, we could simulate random key press sequences (for example, OK, left arrow, sine, 9, …) and have a computer try those automatically. It would indeed be much faster: an average computer can indeed simulate about 150 000 key presses per second.

cat /dev/urandom|pv|./app.bin
1.54MiB 0:00:10 [ 152KiB/s]

And the computer will steadily keep testing. This method is called monkey-testing: imagine an army of monkeys toying around with thousands of calculators, trying to get one of them to crash!

Automatic, yet intelligent!

The number of possible tests being extremely large one could execute billions of test scenarios without seeing an interesting problem appear. That’s why it’s very important to guide the scenario generator towards interesting test cases. That’s precisely what a fuzzer does: you provide it with actual existing scenarios, and the fuzzer will randomly generate similar yet interesting sequences. In practice, we used american fuzzy lop which uses a genetic algorithm to infer new scenarios from the sample set you feed him. To know which scenario is interesting and to select the best ones, the fuzzer needs to add some instrumentation to the target binary.

In practice, we supply the fuzzer with a dozen scenarios, and a few hours (or days…) later, it will most likely produce scenarios that crash or hang the program if either is likely to occur. Whatsoever, the fuzzer is kind enough to shorten those scenarios, which makes debugging even easier.

Fuzzing isn’t a magical solution

Fuzzing is obviously awesome (thank you Michal Zalewski for afl!), yet the failure criterion is somewhat limited: either the program crashes or hangs. Unfortunately, those are not the only failure cases: for example, Epsilon could leak memory or compute “1+2 = 11” without crashing, and the fuzzer wouldn’t know. Each testing methods has its pros and cons: even though it’s very fast and fully automated, a fuzzer has no knowledge of the inner workings of a program, which is why other methods need to be used in complement. In our case, we’re also using:

  • Unit tests: hundreds of math equations and their expected result. This source of truth allows us to avoid major regressions when we improve the math engine.
  • Memory analysis : memcheck from Valgrind
  • Static analysis : clang-checker and coverity scan
  • And last but not least, manual testing as well!