Table of Contents
Fetching ...

Agentic Bug Reproduction for Effective Automated Program Repair at Google

Runxiang Cheng, Michele Tufano, Jürgen Cito, José Cambronero, Pat Rondon, Renyao Wei, Aaron Sun, Satish Chandra

TL;DR

This paper tackles the challenge of scarce Bug Reproduction Tests (BRTs) in bug reports by studying automated BRT generation in Google’s industrial setting. It compares an adapted LIBRO approach with a novel BRT Agent, finding the agent yields substantially higher plausible BRT generation (28% vs 10%) and, when provided to Passerine, increases the rate of plausible fixes by about 30%. It also introduces Ensemble Pass Rate (EPR) as a metric to select promising fixes from APR outputs, with Top-K and threshold-based results showing strong precision at rank-1 and favorable trade-offs between precision and recall. The work demonstrates practical value for industry by enabling better bug reproduction, faster repair validation, and improved bug-fix quality in a real-world, multi-language codebase, underscoring the potential of industrially tuned LLM agents for debugging workflows.

Abstract

Bug reports often lack sufficient detail for developers to reproduce and fix the underlying defects. Bug Reproduction Tests (BRTs), tests that fail when the bug is present and pass when it has been resolved, are crucial for debugging, but they are rarely included in bug reports, both in open-source and in industrial settings. Thus, automatically generating BRTs from bug reports has the potential to accelerate the debugging process and lower time to repair. This paper investigates automated BRT generation within an industry setting, specifically at Google, focusing on the challenges of a large-scale, proprietary codebase and considering real-world industry bugs extracted from Google's internal issue tracker. We adapt and evaluate a state-of-the-art BRT generation technique, LIBRO, and present our agent-based approach, BRT Agent, which makes use of a fine-tuned Large Language Model (LLM) for code editing. Our BRT Agent significantly outperforms LIBRO, achieving a 28% plausible BRT generation rate, compared to 10% by LIBRO, on 80 human-reported bugs from Google's internal issue tracker. We further investigate the practical value of generated BRTs by integrating them with an Automated Program Repair (APR) system at Google. Our results show that providing BRTs to the APR system results in 30% more bugs with plausible fixes. Additionally, we introduce Ensemble Pass Rate (EPR), a metric which leverages the generated BRTs to select the most promising fixes from all fixes generated by APR system. Our evaluation on EPR for Top-K and threshold-based fix selections demonstrates promising results and trade-offs. For example, EPR correctly selects a plausible fix from a pool of 20 candidates in 70% of cases, based on its top-1 ranking.

Agentic Bug Reproduction for Effective Automated Program Repair at Google

TL;DR

This paper tackles the challenge of scarce Bug Reproduction Tests (BRTs) in bug reports by studying automated BRT generation in Google’s industrial setting. It compares an adapted LIBRO approach with a novel BRT Agent, finding the agent yields substantially higher plausible BRT generation (28% vs 10%) and, when provided to Passerine, increases the rate of plausible fixes by about 30%. It also introduces Ensemble Pass Rate (EPR) as a metric to select promising fixes from APR outputs, with Top-K and threshold-based results showing strong precision at rank-1 and favorable trade-offs between precision and recall. The work demonstrates practical value for industry by enabling better bug reproduction, faster repair validation, and improved bug-fix quality in a real-world, multi-language codebase, underscoring the potential of industrially tuned LLM agents for debugging workflows.

Abstract

Bug reports often lack sufficient detail for developers to reproduce and fix the underlying defects. Bug Reproduction Tests (BRTs), tests that fail when the bug is present and pass when it has been resolved, are crucial for debugging, but they are rarely included in bug reports, both in open-source and in industrial settings. Thus, automatically generating BRTs from bug reports has the potential to accelerate the debugging process and lower time to repair. This paper investigates automated BRT generation within an industry setting, specifically at Google, focusing on the challenges of a large-scale, proprietary codebase and considering real-world industry bugs extracted from Google's internal issue tracker. We adapt and evaluate a state-of-the-art BRT generation technique, LIBRO, and present our agent-based approach, BRT Agent, which makes use of a fine-tuned Large Language Model (LLM) for code editing. Our BRT Agent significantly outperforms LIBRO, achieving a 28% plausible BRT generation rate, compared to 10% by LIBRO, on 80 human-reported bugs from Google's internal issue tracker. We further investigate the practical value of generated BRTs by integrating them with an Automated Program Repair (APR) system at Google. Our results show that providing BRTs to the APR system results in 30% more bugs with plausible fixes. Additionally, we introduce Ensemble Pass Rate (EPR), a metric which leverages the generated BRTs to select the most promising fixes from all fixes generated by APR system. Our evaluation on EPR for Top-K and threshold-based fix selections demonstrates promising results and trade-offs. For example, EPR correctly selects a plausible fix from a pool of 20 candidates in 70% of cases, based on its top-1 ranking.

Paper Structure

This paper contains 59 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Bug Reproduction Test (BRT) generation techniques explored in our work.
  • Figure 2: Action distribution by step for the BRT Agent.
  • Figure 3: The number of plausible fixes generated by Passerine with (red) and without (green) BRT as input.
  • Figure 4: Average number of steps per run for Passerine to generate plausible fix with and without BRT as input.
  • Figure 5: Results of Top-K fix selection via EPR.
  • ...and 1 more figures