Table of Contents
Fetching ...

Studying and Understanding the Effectiveness and Failures of Conversational LLM-Based Repair

Aolin Chen, Haojun Wu, Qi Xin, Steven P. Reiss, Jifeng Xuan

TL;DR

Confronts the problem of understanding why conversational LLM-based APR succeeds on some bugs and fails on others. The authors compare cloze-style and full-function repair strategies (OneIter-SH vs OneIter-M), and evaluate the iterative patch-improvement mechanism using Defects4J single-function bugs. They find that full-function repair reduces compilation errors and yields more correct patches than cloze-style, but long methods and external fix ingredients pose challenges; the iterative component adds little benefit and yields many duplicates. The work provides actionable directions for improving problem understanding, behavior inference, and fix-ingredient retrieval, with implications for designing more effective conversational APR systems and broader benchmarks.

Abstract

Automated program repair (APR) is designed to automate the process of bug-fixing. In recent years, thanks to the rapid development of large language models (LLMs), automated repair has achieved remarkable progress. Advanced APR techniques powered by conversational LLMs, most notably ChatGPT, have exhibited impressive repair abilities and gained increasing popularity due to the capabilities of the underlying LLMs in providing repair feedback and performing iterative patch improvement. Despite the superiority, conversational APR techniques still fail to repair a large number of bugs. For example, a state-of-the-art conversational technique ChatRepair does not correctly repair over half of the single-function bugs in the Defects4J dataset. To understand the effectiveness and failures of conversational LLM-based repair and provide possible directions for improvement, we studied the exemplary ChatRepair with a focus on comparing the effectiveness of its cloze-style and full function repair strategies, assessing its key iterative component for patch improvement, and analyzing the repair failures. Our study has led to a series of findings, which we believe provide key implications for future research.

Studying and Understanding the Effectiveness and Failures of Conversational LLM-Based Repair

TL;DR

Confronts the problem of understanding why conversational LLM-based APR succeeds on some bugs and fails on others. The authors compare cloze-style and full-function repair strategies (OneIter-SH vs OneIter-M), and evaluate the iterative patch-improvement mechanism using Defects4J single-function bugs. They find that full-function repair reduces compilation errors and yields more correct patches than cloze-style, but long methods and external fix ingredients pose challenges; the iterative component adds little benefit and yields many duplicates. The work provides actionable directions for improving problem understanding, behavior inference, and fix-ingredient retrieval, with implications for designing more effective conversational APR systems and broader benchmarks.

Abstract

Automated program repair (APR) is designed to automate the process of bug-fixing. In recent years, thanks to the rapid development of large language models (LLMs), automated repair has achieved remarkable progress. Advanced APR techniques powered by conversational LLMs, most notably ChatGPT, have exhibited impressive repair abilities and gained increasing popularity due to the capabilities of the underlying LLMs in providing repair feedback and performing iterative patch improvement. Despite the superiority, conversational APR techniques still fail to repair a large number of bugs. For example, a state-of-the-art conversational technique ChatRepair does not correctly repair over half of the single-function bugs in the Defects4J dataset. To understand the effectiveness and failures of conversational LLM-based repair and provide possible directions for improvement, we studied the exemplary ChatRepair with a focus on comparing the effectiveness of its cloze-style and full function repair strategies, assessing its key iterative component for patch improvement, and analyzing the repair failures. Our study has led to a series of findings, which we believe provide key implications for future research.

Paper Structure

This paper contains 12 sections, 2 tables.