Fuzzing a calculator!

“Beware of bugs in the above code; I have only proved it correct, not tried it.” — Donald Knuth

It’s obviously important to ensure software behaves as expected. That’s what we’re going to focus on today. How do we actually test a program? How can we make our software reliable?

The first and most obvious solution is to manually try all the features of the software, though two issues will quickly arise. First, manual testing is very often strongly biased: one always has habits when using a program, and will inevitably forget to try some edge cases. Second, the amount of cases to try is most often virtually infinite. In the case of Epsilon, our calculator’s operating system, the user can press 46 different keys. Trying every single 10 keys scenario would require 42420747482776576 manual tests, which would take over a billion years if pressing a key every second. And 10 keys would be a rather short scenario…

Automatic tests

What if we generate random scenarios automatically, and get the computer to run them? In the case of Epsilon, we could simulate random key press sequences (for example, OK, left arrow, sine, 9, …) and have a computer try those automatically. It would indeed be much faster: an average computer can simulate about 150 000 key presses per second.

cat /dev/urandom|pv|./app.bin
1.54MiB 0:00:10 [ 152KiB/s]

The computer will steadily keep testing. This method is called monkey-testing. Imagine an army of monkeys toying around with thousands of calculators, trying to get one of them to crash!

Automatic, yet intelligent!

Because the number of possible tests is extremely large, you could execute billions of test scenarios without seeing an interesting problem appear. This is why it’s important to guide the scenario generator towards interesting test cases. That’s precisely what a fuzzer does. You provide actual existing scenarios, and the fuzzer will randomly generate similar yet interesting sequences. In practice, we used american fuzzy lop which uses a genetic algorithm to infer new scenarios from the sample set you feed it. To know which scenario is interesting and to select the best ones, the fuzzer needs to add some instrumentation to the target binary.

We supply the fuzzer with a dozen scenarios, and a few hours (or days…) later, it will most likely produce scenarios that crash or hang the program if either is likely to occur. The fuzzer is kind enough to shorten those scenarios, which makes debugging even easier.

Fuzzing isn’t a magical solution

Fuzzing is awesome (thank you Michal Zalewski for afl!), yet the failure criterion is somewhat limited: either the program crashes or hangs. Unfortunately, these are not the only failure cases. For example, Epsilon could leak memory or compute “1+2 = 11” without crashing, and the fuzzer wouldn’t know. Each testing methods has its pros and cons. Even though it’s fast and fully automated, a fuzzer has no knowledge of the inner workings of a program. This is why other methods need to be used in complement. In our case, we are also using:

Unit tests: hundreds of math equations and their expected result. This source of truth allows us to avoid major regressions when we improve the math engine.
Memory analysis : memcheck from Valgrind
Static analysis : clang-checker and coverity scan
And last but not least, manual testing as well!

Émilie Feral — January 04, 2018

Fuzzing a calculator!

Automatic tests

Automatic, yet intelligent!

Fuzzing isn’t a magical solution

Émilie Feral — Software Engineer

A font for gorgeous worksheets

What are those lines on products made out of plastic?