Table of Contents
Fetching ...

TestExplora: Benchmarking LLMs for Proactive Bug Discovery via Repository-Level Test Generation

Steven Liu, Jane Luo, Xin Zhang, Aofan Liu, Hao Liu, Jie Wu, Ziyang Huang, Yangyu Huang, Yu Kang, Scarlett Li

TL;DR

TestExplora introduces the first benchmark focused on proactive defect discovery in realistic repository environments, using documentation-derived intent as the oracle and evaluating LLMs on 2,389 tasks across 482 repos. It delineates a scalable acquisition and evaluation framework with time-aware data collection to prevent leakage, and defines metrics (HP, F2P, EC, CFG) to assess test quality, bug discovery, and code coverage. Empirical results reveal a substantial capability gap among current models (max F2P ~16.06%), while agentic exploration (SWEAgent/Trae-Agent) and larger, purpose-built models (GPT-5-mini) show promise, achieving up to 29.7% F2P@5. The findings highlight the challenges of cross-module interactions and the value of directed exploration, offering a path toward autonomous software quality assurance with realistic, scalable benchmarks that reflect live repository dynamics.

Abstract

Given that Large Language Models (LLMs) are increasingly applied to automate software development, comprehensive software assurance spans three distinct goals: regression prevention, reactive reproduction, and proactive discovery. Current evaluations systematically overlook the third goal. Specifically, they either treat existing code as ground truth (a compliance trap) for regression prevention, or depend on post-failure artifacts (e.g., issue reports) for bug reproduction-so they rarely surface defects before failures. To bridge this gap, we present TestExplora, a benchmark designed to evaluate LLMs as proactive testers within full-scale, realistic repository environments. TestExplora contains 2,389 tasks from 482 repositories and hides all defect-related signals. Models must proactively find bugs by comparing implementations against documentation-derived intent, using documentation as the oracle. Furthermore, to keep evaluation sustainable and reduce leakage, we propose continuous, time-aware data collection. Our evaluation reveals a significant capability gap: state-of-the-art models achieve a maximum Fail-to-Pass (F2P) rate of only 16.06%. Further analysis indicates that navigating complex cross-module interactions and leveraging agentic exploration are critical to advancing LLMs toward autonomous software quality assurance. Consistent with this, SWEAgent instantiated with GPT-5-mini achieves an F2P of 17.27% and an F2P@5 of 29.7%, highlighting the effectiveness and promise of agentic exploration in proactive bug discovery tasks.

TestExplora: Benchmarking LLMs for Proactive Bug Discovery via Repository-Level Test Generation

TL;DR

TestExplora introduces the first benchmark focused on proactive defect discovery in realistic repository environments, using documentation-derived intent as the oracle and evaluating LLMs on 2,389 tasks across 482 repos. It delineates a scalable acquisition and evaluation framework with time-aware data collection to prevent leakage, and defines metrics (HP, F2P, EC, CFG) to assess test quality, bug discovery, and code coverage. Empirical results reveal a substantial capability gap among current models (max F2P ~16.06%), while agentic exploration (SWEAgent/Trae-Agent) and larger, purpose-built models (GPT-5-mini) show promise, achieving up to 29.7% F2P@5. The findings highlight the challenges of cross-module interactions and the value of directed exploration, offering a path toward autonomous software quality assurance with realistic, scalable benchmarks that reflect live repository dynamics.

Abstract

Given that Large Language Models (LLMs) are increasingly applied to automate software development, comprehensive software assurance spans three distinct goals: regression prevention, reactive reproduction, and proactive discovery. Current evaluations systematically overlook the third goal. Specifically, they either treat existing code as ground truth (a compliance trap) for regression prevention, or depend on post-failure artifacts (e.g., issue reports) for bug reproduction-so they rarely surface defects before failures. To bridge this gap, we present TestExplora, a benchmark designed to evaluate LLMs as proactive testers within full-scale, realistic repository environments. TestExplora contains 2,389 tasks from 482 repositories and hides all defect-related signals. Models must proactively find bugs by comparing implementations against documentation-derived intent, using documentation as the oracle. Furthermore, to keep evaluation sustainable and reduce leakage, we propose continuous, time-aware data collection. Our evaluation reveals a significant capability gap: state-of-the-art models achieve a maximum Fail-to-Pass (F2P) rate of only 16.06%. Further analysis indicates that navigating complex cross-module interactions and leveraging agentic exploration are critical to advancing LLMs toward autonomous software quality assurance. Consistent with this, SWEAgent instantiated with GPT-5-mini achieves an F2P of 17.27% and an F2P@5 of 29.7%, highlighting the effectiveness and promise of agentic exploration in proactive bug discovery tasks.
Paper Structure (13 sections, 6 equations, 7 figures, 5 tables)

This paper contains 13 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: The data acquisition process of TestExplora. Different from SWTBench swtbench, TestExplora is characterized by its reliance on documentation rather than issue descriptions and requires models to proactively identify bugs that violate the documentation, instead of reactively reproducing known failures. Moreover, unlike SWTBench, which is derived from SWEBench yang2024swe, the repositories we collect are mutually exclusive from SWEBench and are designed to be extensible.
  • Figure 2: The statistical information of TestExplora. Categories denotes the numbers of repository categories of repositories. In Test Invokes, Entries per Test counts functions invoked by a test case, while $\mathcal{P}_c$Depth is the invocation distance between the test case and the modified code patch.
  • Figure 2: According to the differences in input information, the tests are mainly divided into two scenarios: White Box testing and Black Box testing, while performance is evaluated via four metrics: $HP$, $F2P$, $EC$ and $CFG$.
  • Figure 3: The impact of the number of generated test cases on performance. The best performance of each model is highlighted.
  • Figure 4: Fail-to-Pass success rates across instance year buckets for six LLMs without dependency code access, highlighting each model’s peak performance season.
  • ...and 2 more figures