Input Reduction Enhanced LLM-based Program Repair
Boyang Yang, Luyao Ren, Xin Yin, Jiadong Ren, Haoye Tian, Shunfu Jin
TL;DR
This paper tackles the problem of long, failure-inducing test inputs degrading LLM-based program repair performance due to the lost-in-the-middle effect. It introduces ReduceFix, a three-stage pipeline that automatically generates a task-specific input reducer with an LLM, reduces the failing input within a time bound, and uses the compact input to guide patch generation, truncation-aware prompting, and multi-patch validation. Through the new LFTBench long-input APR benchmark (200 bugs from AtCoder ABC tasks) and OSS-Fuzz tests, ReduceFix achieves up to 89.1% input compression and up to 53.8% gains in pass@10, while also boosting existing pipelines like ChatRepair and CREF by notable margins. The results demonstrate that automatic, task-aware input reduction is a practical, drop-in enhancement for LLM-based APR with strong cross-language and real-world applicability. Overall, the work establishes input reduction as a core component to improve the scalability and reliability of automated program repair.
Abstract
Large Language Models (LLMs) have shown great potential in Automated Program Repair (APR). Test inputs, being crucial for reasoning the root cause of failures, are always included in the prompt for LLM-based APR. Unfortunately, LLMs struggle to retain key information in long prompts. When the test inputs are extensive in the prompt, this may trigger the "lost-in-the-middle" issue, compromising repair performance. ReduceFix prompts an LLM to generate a reducer that minimizes failure-inducing test inputs without human effort, and then feeds the reduced failure-inducing inputs to guide patch generation. For targeted evaluation, we constructed LFTBench, the first long-input APR benchmark with 200 real bugs from 20 programming tasks, each paired with a failure-inducing input whose median size is 1 MB. On this benchmark, ReduceFix shrinks inputs by 89.1% on average and improves overall pass@10 by up to 53.8% relative to a prompt that includes the original test, and by 17.6% compared with omitting the test entirely. Adding the same reduction step to ChatRepair and CREF increases their fix rate by 21.3% and 2.6%, respectively, without other changes. Our gains hold against a ddmin-only reducing template baseline and transfer to repository-level OSS-Fuzz cases. Ablation studies further highlight the impact of input length and compressed failure information on repair success. These results underscore that automatically reducing failing inputs is a practical and powerful complement to LLM-based APR, significantly improving its scalability and effectiveness.
