Table of Contents
Fetching ...

PBFuzz: Agentic Directed Fuzzing for PoV Generation

Haochen Zeng, Andrew Bao, Jiajun Cheng, Chengyu Song

TL;DR

PBFuzz presents an agentic framework that leverages LLM-driven planning, customized tool orchestration, persistent memory, and property-based testing to generate proof-of-vulnerability inputs. By formalizing PoV generation as a Plan-Implement-Execute-Reflect cycle, the system extracts semantic reachability and triggering constraints, encodes them into typed input spaces, and solves them efficiently with a two-stage, constraint-guided fuzzing process. Evaluated on the Magma benchmark, PBFuzz triggers 57 CVEs (including 17 unique to it) within 30 minutes, substantially outperforming state-of-the-art baselines in coverage, speed, and consistency, with an average cost of $1.83 per PoV. The work demonstrates that integrating semantic reasoning, structured memory, and property-based constraint solving can overcome the inefficiencies and constraint-satisfaction challenges of traditional fuzzing and one-shot LLM approaches, paving the way for more effective automated security testing pipelines.

Abstract

Proof-of-Vulnerability (PoV) input generation is a critical task in software security and supports downstream applications such as path generation and validation. Generating a PoV input requires solving two sets of constraints: (1) reachability constraints for reaching vulnerable code locations, and (2) triggering constraints for activating the target vulnerability. Existing approaches, including directed greybox fuzzing and LLM-assisted fuzzing, struggle to efficiently satisfy these constraints. This work presents an agentic method that mimics human experts. Human analysts iteratively study code to extract semantic reachability and triggering constraints, form hypotheses about PoV triggering strategies, encode them as test inputs, and refine their understanding using debugging feedback. We automate this process with an agentic directed fuzzing framework called PBFuzz. PBFuzz tackles four challenges in agentic PoV generation: autonomous code reasoning for semantic constraint extraction, custom program-analysis tools for targeted inference, persistent memory to avoid hypothesis drift, and property-based testing for efficient constraint solving while preserving input structure. Experiments on the Magma benchmark show strong results. PBFuzz triggered 57 vulnerabilities, surpassing all baselines, and uniquely triggered 17 vulnerabilities not exposed by existing fuzzers. PBFuzz achieved this within a 30-minute budget per target, while conventional approaches use 24 hours. Median time-to-exposure was 339 seconds for PBFuzz versus 8680 seconds for AFL++ with CmpLog, giving a 25.6x efficiency improvement with an API cost of 1.83 USD per vulnerability.

PBFuzz: Agentic Directed Fuzzing for PoV Generation

TL;DR

PBFuzz presents an agentic framework that leverages LLM-driven planning, customized tool orchestration, persistent memory, and property-based testing to generate proof-of-vulnerability inputs. By formalizing PoV generation as a Plan-Implement-Execute-Reflect cycle, the system extracts semantic reachability and triggering constraints, encodes them into typed input spaces, and solves them efficiently with a two-stage, constraint-guided fuzzing process. Evaluated on the Magma benchmark, PBFuzz triggers 57 CVEs (including 17 unique to it) within 30 minutes, substantially outperforming state-of-the-art baselines in coverage, speed, and consistency, with an average cost of $1.83 per PoV. The work demonstrates that integrating semantic reasoning, structured memory, and property-based constraint solving can overcome the inefficiencies and constraint-satisfaction challenges of traditional fuzzing and one-shot LLM approaches, paving the way for more effective automated security testing pipelines.

Abstract

Proof-of-Vulnerability (PoV) input generation is a critical task in software security and supports downstream applications such as path generation and validation. Generating a PoV input requires solving two sets of constraints: (1) reachability constraints for reaching vulnerable code locations, and (2) triggering constraints for activating the target vulnerability. Existing approaches, including directed greybox fuzzing and LLM-assisted fuzzing, struggle to efficiently satisfy these constraints. This work presents an agentic method that mimics human experts. Human analysts iteratively study code to extract semantic reachability and triggering constraints, form hypotheses about PoV triggering strategies, encode them as test inputs, and refine their understanding using debugging feedback. We automate this process with an agentic directed fuzzing framework called PBFuzz. PBFuzz tackles four challenges in agentic PoV generation: autonomous code reasoning for semantic constraint extraction, custom program-analysis tools for targeted inference, persistent memory to avoid hypothesis drift, and property-based testing for efficient constraint solving while preserving input structure. Experiments on the Magma benchmark show strong results. PBFuzz triggered 57 vulnerabilities, surpassing all baselines, and uniquely triggered 17 vulnerabilities not exposed by existing fuzzers. PBFuzz achieved this within a 30-minute budget per target, while conventional approaches use 24 hours. Median time-to-exposure was 339 seconds for PBFuzz versus 8680 seconds for AFL++ with CmpLog, giving a 25.6x efficiency improvement with an API cost of 1.83 USD per vulnerability.

Paper Structure

This paper contains 57 sections, 11 figures, 5 tables.

Figures (11)

  • Figure 1: CVE-2017-9047 buffer overflow in libxml2's [0.8]xmlSnprintfElementContent() function. The bounds check uses the stale buffer length [0.8]len instead of the updated [0.8]strlen(buf) after appending namespace prefixes, allowing writes beyond the allocated memory.
  • Figure 2: PBFuzz Architecture Overview.
  • Figure 3: Property-based fuzzing workflow with two-stage test generation: Stage 1 iterates concrete parameters, Stage 2 performs heuristic parameter space sampling.
  • Figure 4: Static analysis workflow for call graph analysis and deviation detection.
  • Figure 5: Effectiveness across Magma benchmarks ordered by performance. Top: total unique vulnerabilities triggered. Middle: number of vulnerabilities exclusively triggered by PBFuzz compared to other fuzzers. Bottom: compared to PBFuzz, number of vulnerabilities only discovered by baselines.
  • ...and 6 more figures