Table of Contents
Fetching ...

Assessing the Latent Automated Program Repair Capabilities of Large Language Models using Round-Trip Translation

Fernando Vallecillos Ruiz, Anastasiia Grishina, Max Hort, Leon Moonen

TL;DR

This work introduces round-trip translation (RTT) as a latent capability of large language models for automated program repair, evaluating nine LLMs across four Java benchmarks without fine-tuning. RTT translates buggy code to an intermediate representation (programming language or English) and back, seeking patches that pass test suites; results show RTT yields 100 plausible patches and 97 correct patches on HumanEval-Java with GPT-4 in NL mode, and discovers 46 unique fixes not found by APR-tuned models, though overall patch quality lags behind state-of-the-art APR methods. Large models and NL intermediates improve plausibility and compilability, while PL intermediates often underperform unless the intermediate is sufficiently distinct from the source. The study highlights RTT's potential as a complementary APR component in ensemble frameworks, its limitations in maintaining code style and achieving high test-pass rates, and provides a replication package for reproducibility. Overall, RTT expands the toolbox for code repair by leveraging latent corrective signals in LLMs and points to further research in model variety, intermediate representations, and multi-agent APR setups.

Abstract

Research shows that errors in natural language can be corrected by translating texts to another language and back using language models. We explore to what extent this latent correction capability extends to Automated Program Repair (APR) by investigating Round-Trip Translation (RTT): translating code from one programming language into another programming or natural language and back, using Large Language Models (LLMs). We hypothesize that RTT restores patterns most commonly seen in the LLM's training corpora through regression toward the mean, replacing infrequent bugs with more frequent, natural, bug-free code. To test this hypothesis, we employ nine LLMs and four common APR benchmarks in Java, and perform a detailed quantitative and qualitative analysis of RTT-generated patches. We find that RTT through English generates plausible patches for 100 of 164 bugs with GPT-4 on the HumanEval-Java benchmark, and 97 are found to be correct in our manual assessment. Moreover, RTT uniquely generates plausible patches for 46 bugs that were missed by LLMs specifically fine-tuned for APR. While this demonstrates the viability of RTT for APR, we also observe limitations, such as a lower overall bug fix rate than the state-of-the-art and diluting the original coding style. We analyze the impact of these limitations and discuss the potential of using RTT as a complementary component in APR frameworks. A replication package is available for download from https://doi.org/10.5281/zenodo.10500593. Keywords: automated program repair, large language model, machine translation

Assessing the Latent Automated Program Repair Capabilities of Large Language Models using Round-Trip Translation

TL;DR

This work introduces round-trip translation (RTT) as a latent capability of large language models for automated program repair, evaluating nine LLMs across four Java benchmarks without fine-tuning. RTT translates buggy code to an intermediate representation (programming language or English) and back, seeking patches that pass test suites; results show RTT yields 100 plausible patches and 97 correct patches on HumanEval-Java with GPT-4 in NL mode, and discovers 46 unique fixes not found by APR-tuned models, though overall patch quality lags behind state-of-the-art APR methods. Large models and NL intermediates improve plausibility and compilability, while PL intermediates often underperform unless the intermediate is sufficiently distinct from the source. The study highlights RTT's potential as a complementary APR component in ensemble frameworks, its limitations in maintaining code style and achieving high test-pass rates, and provides a replication package for reproducibility. Overall, RTT expands the toolbox for code repair by leveraging latent corrective signals in LLMs and points to further research in model variety, intermediate representations, and multi-agent APR setups.

Abstract

Research shows that errors in natural language can be corrected by translating texts to another language and back using language models. We explore to what extent this latent correction capability extends to Automated Program Repair (APR) by investigating Round-Trip Translation (RTT): translating code from one programming language into another programming or natural language and back, using Large Language Models (LLMs). We hypothesize that RTT restores patterns most commonly seen in the LLM's training corpora through regression toward the mean, replacing infrequent bugs with more frequent, natural, bug-free code. To test this hypothesis, we employ nine LLMs and four common APR benchmarks in Java, and perform a detailed quantitative and qualitative analysis of RTT-generated patches. We find that RTT through English generates plausible patches for 100 of 164 bugs with GPT-4 on the HumanEval-Java benchmark, and 97 are found to be correct in our manual assessment. Moreover, RTT uniquely generates plausible patches for 46 bugs that were missed by LLMs specifically fine-tuned for APR. While this demonstrates the viability of RTT for APR, we also observe limitations, such as a lower overall bug fix rate than the state-of-the-art and diluting the original coding style. We analyze the impact of these limitations and discuss the potential of using RTT as a complementary component in APR frameworks. A replication package is available for download from https://doi.org/10.5281/zenodo.10500593. Keywords: automated program repair, large language model, machine translation
Paper Structure (39 sections, 5 equations, 15 figures, 7 tables)

This paper contains 39 sections, 5 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: High-level overview of the RTT process with concrete examples taken from our empirical evaluation. The red highlight on the left indicates the buggy line, the green highlight on the right is the repaired line.
  • Figure 1: Problem: FACTORIZE (HumanEval-Java).
  • Figure 2: Post-processing step to overwrite scope and method name in the output of the LLM.
  • Figure 2: Problem: LARGEST_PRIME_FACTOR (HumanEval-Java).
  • Figure 3: Number of unique bugs fixed in various datasets by RTT through NL with a fixed language model. The largest number for each dataset is highlighted in bold.
  • ...and 10 more figures