Table of Contents
Fetching ...

Repair Is Nearly Generation: Multilingual Program Repair with LLMs

Harshit Joshi, José Cambronero, Sumit Gulwani, Vu Le, Ivan Radicek, Gust Verbruggen

TL;DR

Ring demonstrates that multilingual program repair can be effectively powered by a code-trained LLMC, achieving competitive results across Excel, Power Fx, Python, JavaScript, C, and PowerShell. By decomposing repair into fault localization, code transformation via few-shot example selection, and candidate ranking of Codex outputs, ring leverages language tooling and error-message-based prompts to generalize across languages. The work provides evidence that LLMC-powered repair can match or exceed language-specific engines in several domains and introduces a new PowerShell benchmark. It also outlines practical guidance for building cross-language example banks and adapting the approach to new languages, pointing to potential enhancements in ranking models and adaptive prompt strategies.

Abstract

Most programmers make mistakes when writing code. Some of these mistakes are small and require few edits to the original program -- a class of errors recently termed last mile mistakes. These errors break the flow for experienced developers and can stump novice programmers. Existing automated repair techniques targeting this class of errors are language-specific and do not easily carry over to new languages. Transferring symbolic approaches requires substantial engineering and neural approaches require data and retraining. We introduce RING, a multilingual repair engine powered by a large language model trained on code (LLMC) such as Codex. Such a multilingual engine enables a flipped model for programming assistance, one where the programmer writes code and the AI assistance suggests fixes, compared to traditional code suggestion technology. Taking inspiration from the way programmers manually fix bugs, we show that a prompt-based strategy that conceptualizes repair as localization, transformation, and candidate ranking, can successfully repair programs in multiple languages with minimal effort. We present the first results for such a multilingual repair engine by evaluating on 6 different languages and comparing performance to language-specific repair engines. We show that RING can outperform language-specific repair engines for three of these languages.

Repair Is Nearly Generation: Multilingual Program Repair with LLMs

TL;DR

Ring demonstrates that multilingual program repair can be effectively powered by a code-trained LLMC, achieving competitive results across Excel, Power Fx, Python, JavaScript, C, and PowerShell. By decomposing repair into fault localization, code transformation via few-shot example selection, and candidate ranking of Codex outputs, ring leverages language tooling and error-message-based prompts to generalize across languages. The work provides evidence that LLMC-powered repair can match or exceed language-specific engines in several domains and introduces a new PowerShell benchmark. It also outlines practical guidance for building cross-language example banks and adapting the approach to new languages, pointing to potential enhancements in ranking models and adaptive prompt strategies.

Abstract

Most programmers make mistakes when writing code. Some of these mistakes are small and require few edits to the original program -- a class of errors recently termed last mile mistakes. These errors break the flow for experienced developers and can stump novice programmers. Existing automated repair techniques targeting this class of errors are language-specific and do not easily carry over to new languages. Transferring symbolic approaches requires substantial engineering and neural approaches require data and retraining. We introduce RING, a multilingual repair engine powered by a large language model trained on code (LLMC) such as Codex. Such a multilingual engine enables a flipped model for programming assistance, one where the programmer writes code and the AI assistance suggests fixes, compared to traditional code suggestion technology. Taking inspiration from the way programmers manually fix bugs, we show that a prompt-based strategy that conceptualizes repair as localization, transformation, and candidate ranking, can successfully repair programs in multiple languages with minimal effort. We present the first results for such a multilingual repair engine by evaluating on 6 different languages and comparing performance to language-specific repair engines. We show that RING can outperform language-specific repair engines for three of these languages.
Paper Structure (40 sections, 11 figures, 9 tables)

This paper contains 40 sections, 11 figures, 9 tables.

Figures (11)

  • Figure 1: ring, powered by a Large Language Model trained on Code (LLMC), performs multi-lingual program repair. ring obtains fault localization information from error messages and leverages LLMC's few shot capabilities for code transformation through example selection, forming the prompt. Finally, a simple, yet effective, technique is used for ranking repair candidates.
  • Figure 2: A real Python 3 syntax error from the BIFI dataset. The highlighted code uses tuple parameter unpacking syntax, which was valid in Python 2 but removed from Python 3. All listings are simplified for presentation clarity and brevity.
  • Figure 3: To aid fault localization, we include a detailed compiler error message with line/column span information. We prepare uniform messages across languages by extracting details from the corresponding language compiler/analyzer.
  • Figure 4: Our smart selection of few-shots retrieves relevant buggy-fix examples from an example bank. Shots are retrieved based on a similarity metric over error diagnostics. The shot selected (pink background) displays the same invalid signature-level tuple parameter unpacking (dark red background, bold) as our target program. The fixed portion of the shot (green background, bold) removes the parentheses.
  • Figure 5: We consider separately the programs not repaired at pass@1 by ring and language-specific baselines. We compute an approximate error localization metric, which marks as correctly localized any edit that is within $k$ tokens of the groundtruth edit location. When ring fails to repair a program it correctly localizes a larger fraction of programs compared to the language-specific baselines.
  • ...and 6 more figures