Table of Contents
Fetching ...

Beyond Localization: Recoverable Headroom and Residual Frontier in Repository-Level RAG-APR

Pengtao Zhao, Boyang Yang, Bach Le, Feng Liu, Haoye Tian

Abstract

Repository-level automated program repair (APR) increasingly treats stronger localization as the main path to better repair. We ask a more targeted question: once localization is strengthened, which post-localization levers still provide recoverable gains, which are bounded within our protocol, and what residual frontier remains? We study this question on SWE-bench Lite with three representative repository-level RAG-APR paradigms, Agentless, KGCompass, and ExpeRepair. Our protocol combines Oracle Localization, within-pool Best-of-K, fixed-interface added context probes with per-condition same-token filler controls and same-repository hard negatives, and a common-wrapper oracle check. Oracle Localization improves all three systems, but Oracle success still stays below 50%. Extra candidate diversity still helps inside the sampled 10-patch pools, but that headroom saturates quickly. Under the two fixed interfaces, most informative added context conditions still outperform their own matched controls. The common-wrapper check shows different system responses: under a common wrapper, gains remain large for KGCompass and ExpeRepair, while Agentless changes more with builder choice. Prompt-level fusion still leaves a large residual frontier: the best fixed probe adds only 6 solved instances beyond the native three-system Solved@10 union. Overall, stronger localization, bounded search, evidence quality, and interface design all shape repository-level repair outcomes.

Beyond Localization: Recoverable Headroom and Residual Frontier in Repository-Level RAG-APR

Abstract

Repository-level automated program repair (APR) increasingly treats stronger localization as the main path to better repair. We ask a more targeted question: once localization is strengthened, which post-localization levers still provide recoverable gains, which are bounded within our protocol, and what residual frontier remains? We study this question on SWE-bench Lite with three representative repository-level RAG-APR paradigms, Agentless, KGCompass, and ExpeRepair. Our protocol combines Oracle Localization, within-pool Best-of-K, fixed-interface added context probes with per-condition same-token filler controls and same-repository hard negatives, and a common-wrapper oracle check. Oracle Localization improves all three systems, but Oracle success still stays below 50%. Extra candidate diversity still helps inside the sampled 10-patch pools, but that headroom saturates quickly. Under the two fixed interfaces, most informative added context conditions still outperform their own matched controls. The common-wrapper check shows different system responses: under a common wrapper, gains remain large for KGCompass and ExpeRepair, while Agentless changes more with builder choice. Prompt-level fusion still leaves a large residual frontier: the best fixed probe adds only 6 solved instances beyond the native three-system Solved@10 union. Overall, stronger localization, bounded search, evidence quality, and interface design all shape repository-level repair outcomes.

Paper Structure

This paper contains 40 sections, 6 figures, 9 tables.

Figures (6)

  • Figure 1: High-level oracle intervention points in the three native pipelines. Oracle injects gold-patch-derived pre-patch file and line spans before each system builds its own final repair prompt.
  • Figure 2: Instance-level three-state outcome transition paths from Baseline to Oracle Localization and then to Best-of-$K$ ($K=10$). Each ribbon follows the same benchmark instances across stages, with green for resolved, orange for completed but unresolved, and gray for no patch submitted.
  • Figure 3: Greedy within-pool upper bounds and fixed-pool selector replay under Oracle Localization for the three native systems. The panels keep the four main references: greedy upper bound, random mean, original order, and cluster-diversity reranking. The red curve and band show the greedy upper bound with a bootstrap 95% interval, and the green band shows the range across fixed random seeds. Table \ref{['tab:rq2-selector-k5-others']} reports the remaining selector rows at $K{=}5$.
  • Figure 4: Solved-instance overlap across Agentless, KGCompass, and ExpeRepair under Baseline, Oracle Localization, and Best-of-$K$. Each Venn region label is an exact instance count.
  • Figure 5: Compact RQ4 frontier ladder on the shared 300-instance pool. Blue is the native three-system union, green is extra frontier recovered beyond that union, and gray is the remaining frontier.
  • ...and 1 more figures