Table of Contents
Fetching ...

Fuzzing: On Benchmarking Outcome as a Function of Benchmark Properties

Dylan Wolff, Marcel Böhme, Abhik Roychoudhury

TL;DR

The paper investigates how benchmark properties shape fuzzing evaluation outcomes and presents two actionable methodologies—control and randomization—to quantify these effects. It demonstrates controlled experiments showing that properties like program execution time and corpus origin can causally alter fuzzer rankings, and introduces a holistic randomized, non-parametric regression framework to assess multiple covariates simultaneously. Through instantiations on FuzzBench data, the study reveals substantial, often non-obvious covariate effects and argues for reporting counterfactual analyses to improve soundness and utility of evaluations. The approach enhances practitioners' ability to interpret fuzzer performance, guides benchmark design to reduce bias, and generalizes to broader evaluation contexts beyond fuzzing.

Abstract

Characteristics of a benchmarking setup clearly can have some impact on the benchmark outcome. In this paper, we explore two methodologies to quantify the impact of the specific properties on the benchmarking outcome. Our first methodology is the controlled experiment to identify a causal relationship between a single property in isolation and the benchmarking outcome. However, manipulating one property exactly may not always be practical or possible. Hence, our second methodology is randomization and non-parametric regression to identify the strength of the relationship between arbitrary benchmark properties (i.e., covariates) and outcome. Together, these two fundamental aspects of experimental design, control and randomization, can provide a comprehensive picture of the impact of various properties of the current benchmark on the fuzzer ranking. These analyses can be used to guide fuzzer developers towards areas of improvement in their tools and allow researchers to make more nuanced claims about fuzzer effectiveness. We instantiate each approach on a subset of properties suspected of impacting the relative effectiveness of fuzzers and quantify the effects of these properties on the evaluation outcome. In doing so, we identify multiple novel properties which can have statistically significant effect on the relative effectiveness of fuzzers.

Fuzzing: On Benchmarking Outcome as a Function of Benchmark Properties

TL;DR

The paper investigates how benchmark properties shape fuzzing evaluation outcomes and presents two actionable methodologies—control and randomization—to quantify these effects. It demonstrates controlled experiments showing that properties like program execution time and corpus origin can causally alter fuzzer rankings, and introduces a holistic randomized, non-parametric regression framework to assess multiple covariates simultaneously. Through instantiations on FuzzBench data, the study reveals substantial, often non-obvious covariate effects and argues for reporting counterfactual analyses to improve soundness and utility of evaluations. The approach enhances practitioners' ability to interpret fuzzer performance, guides benchmark design to reduce bias, and generalizes to broader evaluation contexts beyond fuzzing.

Abstract

Characteristics of a benchmarking setup clearly can have some impact on the benchmark outcome. In this paper, we explore two methodologies to quantify the impact of the specific properties on the benchmarking outcome. Our first methodology is the controlled experiment to identify a causal relationship between a single property in isolation and the benchmarking outcome. However, manipulating one property exactly may not always be practical or possible. Hence, our second methodology is randomization and non-parametric regression to identify the strength of the relationship between arbitrary benchmark properties (i.e., covariates) and outcome. Together, these two fundamental aspects of experimental design, control and randomization, can provide a comprehensive picture of the impact of various properties of the current benchmark on the fuzzer ranking. These analyses can be used to guide fuzzer developers towards areas of improvement in their tools and allow researchers to make more nuanced claims about fuzzer effectiveness. We instantiate each approach on a subset of properties suspected of impacting the relative effectiveness of fuzzers and quantify the effects of these properties on the evaluation outcome. In doing so, we identify multiple novel properties which can have statistically significant effect on the relative effectiveness of fuzzers.
Paper Structure (44 sections, 2 equations, 9 figures)

This paper contains 44 sections, 2 equations, 9 figures.

Figures (9)

  • Figure 1: Decrease in coverage relative to baseline at varying slowdowns of the target program
  • Figure 2: Pair-wise Vargha-Delaney $\hat{A}_{12}$ effect size between edge-coverage of fuzzers. Bold values indicate significance at $p<0.05$.
  • Figure 3: Vargha-Delaney $\hat{A}_{12}$ effect size of edge-coverage between fuzzers. Bold values indicate significance at $p<0.05$.
  • Figure 4: Benchmark Programs
  • Figure 5: Multiple Linear Regression with LibFuzzer as reference level (Fuzzer Ranking $\sim$ Fuzzer $\times$ Properties) [Eqn. \ref{['eqn:holistic']} ].
  • ...and 4 more figures