How Effective Are They? Exploring Large Language Model Based Fuzz Driver Generation

Cen Zhang; Yaowen Zheng; Mingqiang Bai; Yeting Li; Wei Ma; Xiaofei Xie; Yuekang Li; Limin Sun; Yang Liu

How Effective Are They? Exploring Large Language Model Based Fuzz Driver Generation

Cen Zhang, Yaowen Zheng, Mingqiang Bai, Yeting Li, Wei Ma, Xiaofei Xie, Yuekang Li, Limin Sun, Yang Liu

TL;DR

This work tackles the problem of generating effective fuzz drivers for library APIs using large language models. It systematizes six prompting strategies and evaluates them across five LLMs and five temperatures on 86 questions from OSS-Fuzz projects, performing large-scale generation and validation. The study shows strong potential, with a best configuration solving about 78 of 86 APIs (roughly 91%), but identifies key obstacles such as high token costs, semantic correctness of API usage, and dependencies that require richer contextual setup. It also demonstrates that LLM-generated drivers can achieve fuzzing outcomes comparable to industry-driven OSS-Fuzz drivers, while suggesting concrete improvements (e.g., extended API usage, semantic oracles) and integration with industry workflows like OSS-Fuzz-Gen. The findings offer practical guidance for deploying LLM-assisted fuzz driver generation and highlight avenues for reducing costs and increasing automation in real-world fuzz testing.

Abstract

LLM-based (Large Language Model) fuzz driver generation is a promising research area. Unlike traditional program analysis-based method, this text-based approach is more general and capable of harnessing a variety of API usage information, resulting in code that is friendly for human readers. However, there is still a lack of understanding regarding the fundamental issues on this direction, such as its effectiveness and potential challenges. To bridge this gap, we conducted the first in-depth study targeting the important issues of using LLMs to generate effective fuzz drivers. Our study features a curated dataset with 86 fuzz driver generation questions from 30 widely-used C projects. Six prompting strategies are designed and tested across five state-of-the-art LLMs with five different temperature settings. In total, our study evaluated 736,430 generated fuzz drivers, with 0.85 billion token costs ($8,000+ charged tokens). Additionally, we compared the LLM-generated drivers against those utilized in industry, conducting extensive fuzzing experiments (3.75 CPU-year). Our study uncovered that: - While LLM-based fuzz driver generation is a promising direction, it still encounters several obstacles towards practical applications; - LLMs face difficulties in generating effective fuzz drivers for APIs with intricate specifics. Three featured design choices of prompt strategies can be beneficial: issuing repeat queries, querying with examples, and employing an iterative querying process; - While LLM-generated drivers can yield fuzzing outcomes that are on par with those used in the industry, there are substantial opportunities for enhancement, such as extending contained API usage, or integrating semantic oracles to facilitate logical bug detection. Our insights have been implemented to improve the OSS-Fuzz-Gen project, facilitating practical fuzz driver generation in industry.

How Effective Are They? Exploring Large Language Model Based Fuzz Driver Generation

TL;DR

Abstract

How Effective Are They? Exploring Large Language Model Based Fuzz Driver Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)