Table of Contents
Fetching ...

SAFuzz: Semantic-Guided Adaptive Fuzzing for LLM-Generated Code

Ziyi Yang, Kalit Inani, Keshav Kabra, Vima Gupta, Anand Padmanabha Iyer

TL;DR

SAFuzz addresses the challenge of testing AI-generated code at scale by introducing semantic-guided adaptive fuzzing. It uses three core components: prompt variant generation to capture prompt-induced diversity, an LLM-driven fuzz-harness generator with semantic oracles, and a vulnerability predictor that allocates fuzzing resources adaptively. On 96 CSES algorithmic problems, SAFuzz improves vulnerability discrimination precision from $77.9\%$ to $85.7\%$, reduces total fuzzing time by $1.71\times$, and, when combined with unit test generation, increases bug-detection recall from $67.3\%$ to $79.5\%$. These results demonstrate substantial efficiency gains and complementary strengths between fuzzing and unit testing, offering a scalable path for robust AI-generated code safety.

Abstract

While AI-coding assistants accelerate software development, current testing frameworks struggle to keep pace with the resulting volume of AI-generated code. Traditional fuzzing techniques often allocate resources uniformly and lack semantic awareness of algorithmic vulnerability patterns, leading to inefficient resource usage and missed vulnerabilities. To address these limitations, we present a hybrid testing framework that leverages LLM-guided adaptive fuzzing to detect algorithmic vulnerabilities efficiently. Our system SAFuzz integrates prompt-based behavioral diversification, harness generation with problem-specific oracles, and an LLM-based predictor to enable adaptive resource allocation and dynamic early stopping. Evaluating SAFuzz on CSES algorithmic problems, we improve vulnerability discrimination precision from 77.9% to 85.7% and achieve a 1.71x reduction in time cost compared to SOTA GreenFuzz while maintaining comparable recall. We further observe that combining our approach with existing unit test generation methods yields complementary gains, increasing the bug detection recall from 67.3% to 79.5%.

SAFuzz: Semantic-Guided Adaptive Fuzzing for LLM-Generated Code

TL;DR

SAFuzz addresses the challenge of testing AI-generated code at scale by introducing semantic-guided adaptive fuzzing. It uses three core components: prompt variant generation to capture prompt-induced diversity, an LLM-driven fuzz-harness generator with semantic oracles, and a vulnerability predictor that allocates fuzzing resources adaptively. On 96 CSES algorithmic problems, SAFuzz improves vulnerability discrimination precision from to , reduces total fuzzing time by , and, when combined with unit test generation, increases bug-detection recall from to . These results demonstrate substantial efficiency gains and complementary strengths between fuzzing and unit testing, offering a scalable path for robust AI-generated code safety.

Abstract

While AI-coding assistants accelerate software development, current testing frameworks struggle to keep pace with the resulting volume of AI-generated code. Traditional fuzzing techniques often allocate resources uniformly and lack semantic awareness of algorithmic vulnerability patterns, leading to inefficient resource usage and missed vulnerabilities. To address these limitations, we present a hybrid testing framework that leverages LLM-guided adaptive fuzzing to detect algorithmic vulnerabilities efficiently. Our system SAFuzz integrates prompt-based behavioral diversification, harness generation with problem-specific oracles, and an LLM-based predictor to enable adaptive resource allocation and dynamic early stopping. Evaluating SAFuzz on CSES algorithmic problems, we improve vulnerability discrimination precision from 77.9% to 85.7% and achieve a 1.71x reduction in time cost compared to SOTA GreenFuzz while maintaining comparable recall. We further observe that combining our approach with existing unit test generation methods yields complementary gains, increasing the bug detection recall from 67.3% to 79.5%.
Paper Structure (19 sections, 1 equation, 6 figures, 4 tables)

This paper contains 19 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Limitations of ChatUniTest. Left: Validity rate of the generated unit tests across the problems. Right: Bug detection rate across the different bug categories.
  • Figure 2: Overview of SAFuzz. This pipeline has three stages. Stage 1 generates diverse prompt variations to simulate real-world user interactions. Stage 2 involves generation of problem-specific fuzz harnesses with semantic oracles. In Stage 3, a hybrid vulnerability predictor analyzes static and LLM-guided code features to estimate risk, filters out non-vulnerable code and intelligently allocates fuzzing time budgets. Finally, the remaining programs are fuzzed against their harnesses according to their allocated budget. Overall, SAFuzz prioritizes fuzzing high-risk targets to efficiently detect vulnerabilities.
  • Figure 3: Precision comparison across filtering thresholds. SAFuzz maintains higher precision across the threshold range
  • Figure 4: Comparison of discrimination capability across thresholds. achieves a superior trade-off by filtering significantly more non-vulnerable code (x-axis) while minimizing evicting vulnerable targets(y-axis) compared to GreenFuzz.
  • Figure 5: Tradeoff between time budget and bugs caught
  • ...and 1 more figures