Table of Contents
Fetching ...

RandSet: Randomized Corpus Reduction for Fuzzing Seed Scheduling

Yuchong Xie, Kaikai Zhang, Yu Liu, Rundong Yang, Ping Chen, Shuai Wang, Dongdong She

TL;DR

This work proposes RandSet, a novel randomized corpus reduction technique that reduces corpus size and yields diverse seed selection simultaneously with minimal overhead, and introduces randomness into corpus reduction to enjoy two benefits of a randomized algorithm: randomized output (diverse seed selection) and low runtime cost.

Abstract

Seed explosion is a fundamental problem in fuzzing seed scheduling, where a fuzzer maintains a huge corpus and fails to choose promising seeds. Existing works focus on seed prioritization but still suffer from seed explosion since corpus size remains huge. We tackle this from a new perspective: corpus reduction, i.e., computing a seed corpus subset. However, corpus reduction could lead to poor seed diversity and large runtime overhead. Prior techniques like cull_queue, AFL-Cmin, and MinSet suffer from poor diversity or prohibitive overhead, making them unsuitable for high-frequency seed scheduling. We propose RandSet, a novel randomized corpus reduction technique that reduces corpus size and yields diverse seed selection simultaneously with minimal overhead. Our key insight is introducing randomness into corpus reduction to enjoy two benefits of a randomized algorithm: randomized output (diverse seed selection) and low runtime cost. Specifically, we formulate corpus reduction as a set cover problem and compute a randomized subset covering all features of the entire corpus. We then schedule seeds from this small, randomized subset rather than the entire corpus, effectively mitigating seed explosion. We implement RandSet on three popular fuzzers: AFL++, LibAFL, and Centipede, and evaluate it on standalone programs, FuzzBench, and Magma. Results show RandSet achieves significantly more diverse seed selection than other reduction techniques, with average subset ratios of 4.03% and 5.99% on standalone and FuzzBench programs. RandSet achieves a 16.58% coverage gain on standalone programs and up to 3.57% on FuzzBench in AFL++, triggers up to 7 more ground-truth bugs than the state-of-the-art on Magma, while introducing only 1.17%-3.93% overhead.

RandSet: Randomized Corpus Reduction for Fuzzing Seed Scheduling

TL;DR

This work proposes RandSet, a novel randomized corpus reduction technique that reduces corpus size and yields diverse seed selection simultaneously with minimal overhead, and introduces randomness into corpus reduction to enjoy two benefits of a randomized algorithm: randomized output (diverse seed selection) and low runtime cost.

Abstract

Seed explosion is a fundamental problem in fuzzing seed scheduling, where a fuzzer maintains a huge corpus and fails to choose promising seeds. Existing works focus on seed prioritization but still suffer from seed explosion since corpus size remains huge. We tackle this from a new perspective: corpus reduction, i.e., computing a seed corpus subset. However, corpus reduction could lead to poor seed diversity and large runtime overhead. Prior techniques like cull_queue, AFL-Cmin, and MinSet suffer from poor diversity or prohibitive overhead, making them unsuitable for high-frequency seed scheduling. We propose RandSet, a novel randomized corpus reduction technique that reduces corpus size and yields diverse seed selection simultaneously with minimal overhead. Our key insight is introducing randomness into corpus reduction to enjoy two benefits of a randomized algorithm: randomized output (diverse seed selection) and low runtime cost. Specifically, we formulate corpus reduction as a set cover problem and compute a randomized subset covering all features of the entire corpus. We then schedule seeds from this small, randomized subset rather than the entire corpus, effectively mitigating seed explosion. We implement RandSet on three popular fuzzers: AFL++, LibAFL, and Centipede, and evaluate it on standalone programs, FuzzBench, and Magma. Results show RandSet achieves significantly more diverse seed selection than other reduction techniques, with average subset ratios of 4.03% and 5.99% on standalone and FuzzBench programs. RandSet achieves a 16.58% coverage gain on standalone programs and up to 3.57% on FuzzBench in AFL++, triggers up to 7 more ground-truth bugs than the state-of-the-art on Magma, while introducing only 1.17%-3.93% overhead.
Paper Structure (35 sections, 3 equations, 5 figures, 28 tables, 2 algorithms)

This paper contains 35 sections, 3 equations, 5 figures, 28 tables, 2 algorithms.

Figures (5)

  • Figure 1: An example of our seed corpus reduction. We formulate the corpus reduction as a set cover problem. The seed corpus with redundant seeds (6 seeds) is reduced into a distilled corpus (3 seeds). The distilled corpus preserves all feature coverage (4 colors) with minimal redundancy. Each vertical bar represents a seed and each colored block indicates a unique feature covered by the seeds.
  • Figure 2: Diversity comparison between Deterministic and Randomized Corpus Reduction algorithms. We demonstrate the selection results from a seed corpus of 6 seeds using different reduction algorithms. Deterministic Corpus Reduction algorithms, i.e., MinSet, AFL-Cmin, and cull_queue (left) consistently select the same three seeds (1, 2, 3) across multiple runs, while our Randomized Set Cover algorithm (right) produces diverse selections: seeds (1, 2, 3) in the first run, seeds (4, 5, 6) in the second run, and seeds (1, 2, 4) in the third run. Each numbered circle represents a seed.
  • Figure 3: Cumulative frequency distributions of seeds for RandSet against other four baselines on FuzzBench programs over a 24-hour fuzzing campaign. The x-axis represents the seed ID sorted in descending order by frequency, the y-axis indicates the cumulative probability of seed selection.
  • Figure 4: Cumulative frequency distributions of seeds for RandSet against other four baselines on standalone programs over a 24-hour fuzzing campaign. The x-axis represents the seed ID sorted in descending order by frequency, the y-axis indicates the cumulative probability of seed selection.
  • Figure 5: Cumulative frequency distributions of seeds for RandSet against other three variants on FuzzBench programs over a 1-hour fuzzing campaign. The x-axis represents the seed ID sorted in descending order by frequency, the y-axis indicates the cumulative probability of seed selection.