Agentic Property-Based Testing: Finding Bugs Across the Python Ecosystem
Muhammad Maaz, Liam DeVoe, Zac Hatfield-Dodds, Nicholas Carlini
TL;DR
This paper introduces agentic property-based testing, an LLM-driven agent that autonomously analyzes Python codebases to infer both local and cross-function properties, generates Hypothesis-based tests, executes them, and produces actionable bug reports. Through a large-scale evaluation over 100 Python packages, the approach uncovers genuine bugs and obtains patches in real projects, illustrating the practical viability of autonomous PBT at scale. The agent relies on a six-step prompt workflow (analyze, understand, propose properties, write tests, execute/triage, report) implemented atop Claude Code, and it demonstrates substantial bug-finding capabilities with quantified validation and cost metrics. While limitations include incomplete manual review and some intent ambiguities, the results highlight LLM-guided PBT as a rigorous, scalable complement to traditional software testing, capable of surfacing issues across diverse libraries.
Abstract
Property-based testing (PBT) is a lightweight formal method, typically implemented as a randomized testing framework. Users specify the input domain for their test using combinators supplied by the PBT framework, and the expected properties or invariants as a unit-test function. The framework then searches for a counterexample, e.g. by generating inputs and calling the test function. In this work, we demonstrate an LLM-based agent which analyzes Python modules, infers function-specific and cross-function properties from code and documentation, synthesizes and executes PBTs, reflects on outputs of these tests to confirm true bugs, and finally outputs actionable bug reports for the developer. We perform an extensive evaluation of our agent across 100 popular Python packages. Of the bug reports generated by the agent, we found after manual review that 56\% were valid bugs and 32\% were valid bugs that we would report to maintainers. We then developed a ranking rubric to surface high-priority valid bugs to developers, and found that of the 21 top-scoring bugs, 86\% were valid and 81\% we would report. The bugs span diverse failure modes from serialization failures to numerical precision errors to flawed cache implementations. We reported 5 bugs, 4 with patches, including to NumPy and cloud computing SDKs, with 3 patches merged successfully. Our results suggest that LLMs with PBT provides a rigorous and scalable method for autonomously testing software. Our code and artifacts are available at: https://github.com/mmaaz-git/agentic-pbt.
