Table of Contents
Fetching ...

SoK: Prudent Evaluation Practices for Fuzzing

Moritz Schloegel, Nils Bars, Nico Schiller, Lukas Bernhard, Tobias Scharnowski, Addison Crump, Arash Ale Ebrahim, Nicolai Bissantz, Marius Muench, Thorsten Holz

TL;DR

This SoK analyzes how fuzzing papers (2018–2023) evaluate their results, highlighting widespread gaps in reproducibility, target selection, benchmark use, and statistical rigor. It systematically reviews 150 papers, annotates eight case studies through artifact reproduction, and shows that many evaluations rely on incomplete baselines, inconsistent runtimes, and questionable metrics such as unique crashes or CVEs. The authors synthesize revised best practices to improve reproducibility, including open artifacts, representative targets, fair and documented setups, robust statistics, and transparent reporting of threats to validity. The work has practical impact by providing a concrete checklist and public guidelines to elevate the methodological quality of fuzzing research and its real-world applicability.

Abstract

Fuzzing has proven to be a highly effective approach to uncover software bugs over the past decade. After AFL popularized the groundbreaking concept of lightweight coverage feedback, the field of fuzzing has seen a vast amount of scientific work proposing new techniques, improving methodological aspects of existing strategies, or porting existing methods to new domains. All such work must demonstrate its merit by showing its applicability to a problem, measuring its performance, and often showing its superiority over existing works in a thorough, empirical evaluation. Yet, fuzzing is highly sensitive to its target, environment, and circumstances, e.g., randomness in the testing process. After all, relying on randomness is one of the core principles of fuzzing, governing many aspects of a fuzzer's behavior. Combined with the often highly difficult to control environment, the reproducibility of experiments is a crucial concern and requires a prudent evaluation setup. To address these threats to validity, several works, most notably Evaluating Fuzz Testing by Klees et al., have outlined how a carefully designed evaluation setup should be implemented, but it remains unknown to what extent their recommendations have been adopted in practice. In this work, we systematically analyze the evaluation of 150 fuzzing papers published at the top venues between 2018 and 2023. We study how existing guidelines are implemented and observe potential shortcomings and pitfalls. We find a surprising disregard of the existing guidelines regarding statistical tests and systematic errors in fuzzing evaluations. For example, when investigating reported bugs, ...

SoK: Prudent Evaluation Practices for Fuzzing

TL;DR

This SoK analyzes how fuzzing papers (2018–2023) evaluate their results, highlighting widespread gaps in reproducibility, target selection, benchmark use, and statistical rigor. It systematically reviews 150 papers, annotates eight case studies through artifact reproduction, and shows that many evaluations rely on incomplete baselines, inconsistent runtimes, and questionable metrics such as unique crashes or CVEs. The authors synthesize revised best practices to improve reproducibility, including open artifacts, representative targets, fair and documented setups, robust statistics, and transparent reporting of threats to validity. The work has practical impact by providing a concrete checklist and public guidelines to elevate the methodological quality of fuzzing research and its real-world applicability.

Abstract

Fuzzing has proven to be a highly effective approach to uncover software bugs over the past decade. After AFL popularized the groundbreaking concept of lightweight coverage feedback, the field of fuzzing has seen a vast amount of scientific work proposing new techniques, improving methodological aspects of existing strategies, or porting existing methods to new domains. All such work must demonstrate its merit by showing its applicability to a problem, measuring its performance, and often showing its superiority over existing works in a thorough, empirical evaluation. Yet, fuzzing is highly sensitive to its target, environment, and circumstances, e.g., randomness in the testing process. After all, relying on randomness is one of the core principles of fuzzing, governing many aspects of a fuzzer's behavior. Combined with the often highly difficult to control environment, the reproducibility of experiments is a crucial concern and requires a prudent evaluation setup. To address these threats to validity, several works, most notably Evaluating Fuzz Testing by Klees et al., have outlined how a carefully designed evaluation setup should be implemented, but it remains unknown to what extent their recommendations have been adopted in practice. In this work, we systematically analyze the evaluation of 150 fuzzing papers published at the top venues between 2018 and 2023. We study how existing guidelines are implemented and observe potential shortcomings and pitfalls. We find a surprising disregard of the existing guidelines regarding statistical tests and systematic errors in fuzzing evaluations. For example, when investigating reported bugs, ...
Paper Structure (75 sections, 1 equation, 6 figures, 4 tables)

This paper contains 75 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Benchmark usage over the years. The numbers in brackets represent the number of papers analyzed for the respective year. Note that some papers use multiple benchmarks, hence the numbers do not add up.
  • Figure 2: Distribution of runtimes used in practice and cumulative distribution function (CDF), which shows that 27% of papers use a runtime of less than 23 hours. 26 papers use multiple, different runtimes; we include all in these cases.
  • Figure 3: Outcome of 339 CVEs that were reported across 35 papers. Only 43% of the CVEs have been acknowledged by the developers. Pending public disclosure, information on CVEs in the Reserved state is withhold.
  • Figure 4: Distribution of trials used in practice and cumulative distribution function (CDF). 8 papers use a different number of trials for different experiments; we include all numbers in this case. Further 21 papers fail to specify the number of trials.
  • Figure 5: The upper two graphs published in the EcoFuzz paper yue2020ecofuzz show a strong advantage over all competitors on the non-standard metric number of totals paths over the number of total executions. The two plots at the bottom compare EcoFuzz on the standard metric branch coverage over time. On the standard metric, EcoFuzz performs significantly worse.
  • ...and 1 more figures