Table of Contents
Fetching ...

Investigating the Transferability of Code Repair for Low-Resource Programming Languages

Kyle Wong, Alfonso Amayuelas, Liangming Pan, William Yang Wang

TL;DR

This work investigates the benefits of distilling code repair for both high and low resource languages to determine if the techniques that are effective in a high resource setting are also applicable in a low resource setting, and shows that distilling the ability to repair code has language dependent benefits.

Abstract

Large language models (LLMs) have shown remarkable performance on code generation tasks. A recent use case is iterative code repair, where an LLM fixes an incorrect program by rationalizing about errors and generating new code. Recent works augment the code repair process by integrating modern techniques such as chain-of-thought reasoning or distillation, but only study their benefits on high-resource languages like Python, and ignore low-resource languages like Perl. To address this gap of knowledge, we investigate the benefits of distilling code repair for both high and low resource languages to determine if the techniques that are effective in a high resource setting are also applicable in a low resource setting. Our evaluation shows that distilling the ability to repair code has language dependent benefits. To explain this behavior, we perform a further analysis and find that contrary to preexisting beliefs, the correlation between reasoning ability and code correction ability is weak. We hypothesize this weak correlation is magnified in low-resource settings where base models lack deep knowledge of a programming language, leading to wavering benefits of code repair.

Investigating the Transferability of Code Repair for Low-Resource Programming Languages

TL;DR

This work investigates the benefits of distilling code repair for both high and low resource languages to determine if the techniques that are effective in a high resource setting are also applicable in a low resource setting, and shows that distilling the ability to repair code has language dependent benefits.

Abstract

Large language models (LLMs) have shown remarkable performance on code generation tasks. A recent use case is iterative code repair, where an LLM fixes an incorrect program by rationalizing about errors and generating new code. Recent works augment the code repair process by integrating modern techniques such as chain-of-thought reasoning or distillation, but only study their benefits on high-resource languages like Python, and ignore low-resource languages like Perl. To address this gap of knowledge, we investigate the benefits of distilling code repair for both high and low resource languages to determine if the techniques that are effective in a high resource setting are also applicable in a low resource setting. Our evaluation shows that distilling the ability to repair code has language dependent benefits. To explain this behavior, we perform a further analysis and find that contrary to preexisting beliefs, the correlation between reasoning ability and code correction ability is weak. We hypothesize this weak correlation is magnified in low-resource settings where base models lack deep knowledge of a programming language, leading to wavering benefits of code repair.
Paper Structure (43 sections, 6 equations, 14 figures, 8 tables)

This paper contains 43 sections, 6 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: A standard code repair framework. In (1) and (2), a code LLM is given a question and generates a solution. In (3), test cases are executed and an error message is extracted. In (4), a repair LLM is given the question, incorrect solution, and error message, and generates a repair. A repair contains a rationale explaining why the old code was incorrect and how to fix it, followed by new code. If the new code is still incorrect, we iteratively generate new repairs using the code from previous repairs. In (5), we stop when all tests pass or after a fixed number of iterations.
  • Figure 2: Our dataset construction pipeline. Examples in the fine-tuning dataset contain an instruction, the original question, the student's incorrect answer, the execution feedback, and the teacher's correct repair.
  • Figure 3: Mean pass@1 versus repair round for CodeLlama-7b-Instruct. Round 0 denotes the initial generation. Rationale-plus-code distillation outperforms rationale-only distillation on low-resource languages, but performs similarly on high-resource languages.
  • Figure 4: The prompt given to GPT-3.5-Turbo to generate the rationale portion of a repair. This is only used for the in-context learning baseline. <Q,A> is replaced with the question and previous answer, while <E> is replaced with the corresponding error.
  • Figure 5: The prompt for generating a repair. For brevity, we only show a one-shot example. <Q,A> is replaced with the question and previous answer, while <E> is replaced with the corresponding error.
  • ...and 9 more figures