cipherdyne.org

Michael Rash, Security Researcher



Code Coverage: Challenges For Open Source Projects

In this blog post I'll pose a few challenges to open source projects. These challenges, among other goals, are designed to push automated test suites to maximize code coverage. For motivation, I'll present code coverage stats from the OpenSSL and OpenSSH test suites, and contrast this with what I'm trying to achieve in the fwknop project.

Stated bluntly, if comprehensive code coverage is not achieved by your test suite, you're doing it wrong.

With major bugs like Heartbleed and "goto fail", there is clearly a renewed need for better automated testing. As Mike Bland demonstrated, the problem isn't so much about technology since unit tests can be built for both Heartbleed and "goto fail". Rather, the problem is that the whole software engineering life cycle needs to embrace better testing.

1) Publish your code coverage stats

Just as open source projects publish source code, test suite code coverage results should be published too. The mere act of publishing code coverage - just as open source itself - is a way to engage developers and users alike to improve such results. Clearly displayed should be all branches, lines, and functions that are exercised by the test suite, and these results should be published for every software release. Without such stats, how can users have confidence that the project test suite is comprehensive? Bugs still crop up even with good code coverage, but many many bugs will be squashed during the development of a test suite that strives to exercise every line of source code.

For a project written in C and compiled with gcc, the lcov tool provides a nice front-end for gcov coverage data. lcov was used to generate the following code coverage reports on OpenSSL-1.0.1h, OpenSSH-6.6p1, and fwknop-2.6.3 using the respective test suite in each project:

Hit Total Coverage
OpenSSL-1.0.1h Lines: 44604 96783 46.1 %
Full code coverage report Functions: 3518 6574 53.5 %
Branches: 22653 67181 33.7 %

OpenSSH-6.6p1 Lines: 18241 33466 54.5 %
Full code coverage report Functions: 1217 1752 69.5 %
Branches: 9362 22674 41.3 %

fwknop-2.6.3 Lines: 6971 7748 90.0 %
Full code coverage report Functions: 376 377 99.7 %
Branches: 4176 5239 79.7 %


The numbers speak for themselves. To be fair, fwknop has a much smaller code base than OpenSSL or OpenSSH - less than 1/10th the size of OpenSSL and about 1/5th the size of OpenSSH in terms of lines reported by gcov. So, presumaby it's a lot easier to reach higher levels of code coverage in fwknop than the other two projects. Nevertheless, it is shocking that the line coverage in the OpenSSL test suite is below 50%, and not much better in OpenSSH. What are the odds that the other half of the code is bug free? What are the odds that changes in newer versions won't break assumptions made in the untested code? What are the odds that one of the next security vulnerabilities announced in either project stems from this code? Of course, test suites are not a panacea, but there are almost certainly bugs lying in wait within the untested code. It is easier to have confidence in code that has at least some test suite coverage than code that has zero coverage - at least as far as the test suite is concerned (I'm not saying that other tools are not being used). Both OpenSSL and OpenSSH use tools beyond their respective test suites to try and maintain code quality - OpenSSL uses Coverity for example - but these tools are not integrated with test suite results and do not contribute to the code coverage stats above. What I'm suggesting is that the test suites themselves should get much closer to 100% coverage, and this may require the integration of infrastructure like a custom fuzzer or fault injection library (more on this below).

On another note, given that an explicit design goal of the fwknop test suite is to maximize code coverage, and given the results above, it is clear there is significant work left to do.

2) Make it easy to automatically produce code coverage results

Neither OpenSSL nor OpenSSH make it easy to automatically generate code coverage stats like those shown above. One can always Google for the appropriate CFLAGS and LDFLAGS settings, recompile, and run lcov, but you shouldn't have to. This should be automatic and built into the test suite as an option. If your project is using autoconf, then there should be a top level --enable-code-coverage switch (or similar) to the configure script, and the test suite should take the next steps to produce the code coverage reports. Without this, there is unnecessary complexity and manual work, and this affects users and developers alike. My guess is this lack of automation is a factor for why code coverage for OpenSSL and OpenSSH is not better. Of course, it takes a lot of effort to develop a test suite with comprehensive code coverage support, but automation is low hanging fruit.

If you want to generate the code coverage reports above, here are two trivial scripts - one for OpenSSH and another for OpenSSL. This one works for OpenSSH:
#!/bin/sh
#
# Basic script to compile OpenSSH with code coverage support via gcc
# gcov and build HTML reports with lcov.
#

LCOV_DIR=lcov-results
LCOV_FILE=coverage.info
LCOV_FILTERED=coverage_final.info
PREFIX=~/install/openssh  ### tmp path

mkdir $LCOV_DIR

make clean
./configure --with-cflags="-fprofile-arcs -ftest-coverage" --with-ldflags="-fprofile-arcs -lgcov" --prefix=$PREFIX
make
make tests

### build coverage info and filter /usr/include/ files
lcov --capture --directory . --output-file $LCOV_FILE
lcov -r $LCOV_FILE /usr/include/\* --output-file $LCOV_DIR/$LCOV_FILTERED

### create the HTML report
genhtml $LCOV_DIR/$LCOV_FILTERED --output-directory $LCOV_DIR

exit

3) Integrate a fuzzer into your test suite

If your test suite does not achieve 99%/95% function/line coverage, architectural changes should be made to reach these goals and beyond. This will likely require that test suite drive a fuzzer against your project and measure how well it exercises the code base.

Looking at code coverage results for older versions of the fwknop project was an eye opener. Although the test suite had hundreds of tests, there were large sections of code that were not exercised. It was for this reason the 2.6.3 release concentrated on more comprehensive automated test coverage. However, achieving better coverage was not a simple matter of executing fwknop components with different configuration files or command line arguments - it required the development of a dedicated SPA packet fuzzer along with a special macro -DENABLE_FUZZING built into the source code to allow the fuzzer to reach portions of code that would have otherwise been more difficult to trigger due to encryption and authentication requirements. This is a similar to the strategy proposed in Michal Zalewski's fuzzer American Fuzzy Lop here (see the "Known Limitations" section).

The main point is that fwknop was changed to support fuzzing driven by the test suite as a way to extend code coverage. It is the strong integration of fuzzing into the test suite that provides a powerful testing technique, and looking at code coverage results allows you to measure it.

Incidentally, an example double free() bug that the fwknop packet fuzzer triggered in conjunction with the test suite can be found here (fixed in 2.6.3).

4) Further extend your code coverage with fault injection

Any C project leveraging libc functions should implement error checking against function return values. The canonical example is checking to see whether malloc() returned NULL, and if so this is usually treated as an unrecoverable error like so:
char *buf = NULL;
int size = 100;

if ((buf = malloc(size)) == NULL) {
    clean_up();
    exit(EXIT_FAILURE);
}
Some projects elect to write a "safe_malloc()" wrapper for malloc() or other libc functions so that error handling can be done in one place, but it is not feasible to do this for every libc function. So, how to verify whether error conditions are properly handled at run time? For malloc(), NULL is typically returned under extremely high memory pressure, so it is hard to trigger this condition and still have a functioning system let alone a functioning test suite. In other words, in the example above, how can the test suite achieve code coverage for the clean_up() function? Other examples include filesystem or network function errors that are returned when disks fill up, or a network communication is blocked, etc.

What's needed is a mechanism for triggering libc faults artificially, without requiring the underlying conditions to actually exist that would normally cause such faults. This is where a fault injection library like libfiu comes in. Not only does it support fault injection at run time against libc functions without the need to link against libfiu (a dedicated binary "fiu-run" takes care of this), but it can also be used to trigger faults in arbitrary non-libc functions within a project to see how function callers handle errors. In fwknop, both strategies are used by the test suite, and this turned up a number of bugs like this one.

Full disclosure: libfiu does not yet support code coverage when executing a binary under fiu-run because there are problems interacting with libc functions necessary to write out the various source_file.c.gcno and source_file.c.gcda coverage files. This issue is being worked on for an upcoming release of libfiu. So, in the context of fwknop, libfiu is used to trigger faults directly in fwknop functions to see how calling functions handle errors, and this strategy is compatible with gcov coverage results. The fiu-run tool is also used, but more from the perspective of trying to crash one of the fwknop binaries since we can't (yet) see code coverage results under fiu-run. Here is an example fault introduced into the fko_get_username() function:
/* Return the current username for this fko context.
*/
int 
fko_get_username(fko_ctx_t ctx, char **username)
{   
#if HAVE_LIBFIU
    fiu_return_on("fko_get_username_init", FKO_ERROR_CTX_NOT_INITIALIZED);
#endif
With the fault set (there is a special command line argument --fault-injection-tag on the fwknopd server command line to enable the fault), the error handling code seen at the end of the example below is executed via the test suite. For proof of error handling execution, see the full coverage report (look at line 240).
/* Popluate a spa_data struct from an initialized (and populated) FKO context.
*/
static int
get_spa_data_fields(fko_ctx_t ctx, spa_data_t *spdat)
{   
    int res = FKO_SUCCESS;

    res = fko_get_username(ctx, &(spdat->username));
    if(res != FKO_SUCCESS)
            return(res);

Once again, it is the integration of fault injection with the test suite and corresponding code coverage reports that extends testing efforts in a powerful way. libfiu offers many nice features, including thread safey, the ability to enable a fault injection tag relative to other functions in the stack, and more.

5) Negative valgrind findings should force tests to fail

So far, a theme in this blog post has been better code coverage through integration. I've attempted to make the case for the integration of fuzzing and fault injection with project test suites, and code coverage stats should be produced for both styles of testing.

A third integration effort is to leverage valgrind. That is, the test suite should run tests underneath valgrind when possible (speed and memory usage may be constraints here depending on the project). If valgrind discovers a memory leak, double free(), or other problem, this finding should automatically cause corresponding tests to fail. In some cases valgrind suppressions will need to be created if a project depends on libraries or other code that is known to have issues under valgrind, but findings within project sources should cause tests to fail. For projects heavy on the crypto side, there are some instances where code is very deliberately built in a manner that triggers a valgrind error (see Tonnerre Lombard's write up on the old Debian OpenSSL vulnerability), but these are not common occurrences and suppressions can always be applied. The average valgrind finding in a large code base should cause test failure.

Although running tests under valgrind will not expand code coverage, valgrind is a powerful tool that should be tightly integrated with your test suite.

Summary

  • Publish your code coverage stats
  • Make it easy to automatically produce code coverage results
  • Integrate a fuzzer into your test suite
  • Further extend your code coverage with fault injection
  • Negative valgrind findings should automatically force tests to fail