Table of Contents
Fetching ...

SeedAIchemy: LLM-Driven Seed Corpus Generation for Fuzzing

Aidan Wen, Norah A. Alzahrani, Jingzhi Jiang, Andrew Joe, Karen Shieh, Andy Zhang, Basel Alomair, David Wagner

TL;DR

SeedAIchemy tackles the bottleneck of fuzzing adoption by automating seed-corpus construction with LLM-driven search-term generation across GitHub, web sources, feature-focused queries, bug trackers, and Common Crawl. It combines five parallel modules, deduplicates, and minimizes the resulting corpus with afl-cmin to optimize fuzzing efficiency. In Magma-based experiments, SeedAIchemy delivers corpus quality close to manually curated sets and outperforms naive and G$^2$FUZZ corpora across bugs reached, bugs triggered, and code coverage, while requiring no manual curation. This approach substantially lowers the cost and expertise needed to use fuzzing effectively in real-world development, broadening its practical impact.

Abstract

We introduce SeedAIchemy, an automated LLM-driven corpus generation tool that makes it easier for developers to implement fuzzing effectively. SeedAIchemy consists of five modules which implement different approaches at collecting publicly available files from the internet. Four of the five modules use large language model (LLM) workflows to construct search terms designed to maximize corpus quality. Corpora generated by SeedAIchemy perform significantly better than a naive corpus and similarly to a manually-curated corpus on a diverse range of target programs and libraries.

SeedAIchemy: LLM-Driven Seed Corpus Generation for Fuzzing

TL;DR

SeedAIchemy tackles the bottleneck of fuzzing adoption by automating seed-corpus construction with LLM-driven search-term generation across GitHub, web sources, feature-focused queries, bug trackers, and Common Crawl. It combines five parallel modules, deduplicates, and minimizes the resulting corpus with afl-cmin to optimize fuzzing efficiency. In Magma-based experiments, SeedAIchemy delivers corpus quality close to manually curated sets and outperforms naive and GFUZZ corpora across bugs reached, bugs triggered, and code coverage, while requiring no manual curation. This approach substantially lowers the cost and expertise needed to use fuzzing effectively in real-world development, broadening its practical impact.

Abstract

We introduce SeedAIchemy, an automated LLM-driven corpus generation tool that makes it easier for developers to implement fuzzing effectively. SeedAIchemy consists of five modules which implement different approaches at collecting publicly available files from the internet. Four of the five modules use large language model (LLM) workflows to construct search terms designed to maximize corpus quality. Corpora generated by SeedAIchemy perform significantly better than a naive corpus and similarly to a manually-curated corpus on a diverse range of target programs and libraries.

Paper Structure

This paper contains 23 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Architecture of SeedAIchemy. SeedAIchemy combines corpora from each submodule, then applies minimization techniques to reduce the size of the final corpus.
  • Figure 2: Example queries and outputs of a single run of SeedAIchemy for the JPG datatype. To preserve space, LLM prompts are simplified versions of the real ones used in SeedAIchemy. Only modules that used an LLM to generate search queries are shown.
  • Figure 3: Bugs reached, bugs triggered, and normalized code coverage averaged over 10 trials. Error bands show the 95% confidence interval. Coverage is normalized by the 24-hour coverage of each Magma fuzz target.
  • Figure 4: Bugs reached, bugs triggered, and code coverage for each target averaged over 10 trials. Error bars show the 95% confidence interval.
  • Figure 5: Time to trigger bug averaged across 10 trials.
  • ...and 3 more figures