Fuzzing: On Benchmarking Outcome as a Function of Benchmark Properties
Dylan Wolff, Marcel Böhme, Abhik Roychoudhury
TL;DR
The paper investigates how benchmark properties shape fuzzing evaluation outcomes and presents two actionable methodologies—control and randomization—to quantify these effects. It demonstrates controlled experiments showing that properties like program execution time and corpus origin can causally alter fuzzer rankings, and introduces a holistic randomized, non-parametric regression framework to assess multiple covariates simultaneously. Through instantiations on FuzzBench data, the study reveals substantial, often non-obvious covariate effects and argues for reporting counterfactual analyses to improve soundness and utility of evaluations. The approach enhances practitioners' ability to interpret fuzzer performance, guides benchmark design to reduce bias, and generalizes to broader evaluation contexts beyond fuzzing.
Abstract
Characteristics of a benchmarking setup clearly can have some impact on the benchmark outcome. In this paper, we explore two methodologies to quantify the impact of the specific properties on the benchmarking outcome. Our first methodology is the controlled experiment to identify a causal relationship between a single property in isolation and the benchmarking outcome. However, manipulating one property exactly may not always be practical or possible. Hence, our second methodology is randomization and non-parametric regression to identify the strength of the relationship between arbitrary benchmark properties (i.e., covariates) and outcome. Together, these two fundamental aspects of experimental design, control and randomization, can provide a comprehensive picture of the impact of various properties of the current benchmark on the fuzzer ranking. These analyses can be used to guide fuzzer developers towards areas of improvement in their tools and allow researchers to make more nuanced claims about fuzzer effectiveness. We instantiate each approach on a subset of properties suspected of impacting the relative effectiveness of fuzzers and quantify the effects of these properties on the evaluation outcome. In doing so, we identify multiple novel properties which can have statistically significant effect on the relative effectiveness of fuzzers.
