Table of Contents
Fetching ...

Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs

Federico Cassano, John Gouwar, Francesca Lucchetti, Claire Schlesinger, Anders Freeman, Carolyn Jane Anderson, Molly Q Feldman, Michael Greenberg, Abhinav Jangda, Arjun Guha

TL;DR

Code LLM performance varies dramatically across programming languages due to data availability. The authors present MultiPL-T, a semi-synthetic pipeline that transfers high-resource language data (Python) into low-resource languages (Julia, Lua, OCaml, R, Racket) by generating unit tests in the high-resource language, translating code, compiling tests into the target language, and filtering translations through test success. This approach yields state-of-the-art results on MultiPL-E benchmarks across multiple model families and sizes, proving data efficiency and cross-model transfer benefits; it also demonstrates robustness through ablations and qualitative evaluations. The work provides open datasets and models, enabling reproducibility and practical application to additional languages, and argues for broader benchmarks and future integration with self-instruction methods. Overall, MultiPL-T offers a scalable, efficient method to democratize Code LLM capabilities for low-resource languages with tangible performance gains.

Abstract

Over the past few years, Large Language Models of Code (Code LLMs) have started to have a significant impact on programming practice. Code LLMs are also emerging as building blocks for research in programming languages and software engineering. However, Code LLMs produce impressive results on programming languages that are well represented in their training data (e.g., Java, Python, or JavaScript), but struggle with low-resource languages that have limited training data available. Low resource languages include OCaml, Racket, and several others. This paper presents an effective approach for boosting the performance of Code LLMs on low-resource languages using semi-synthetic data. Our approach, MultiPL-T, translates training data from high-resource languages into training data for low-resource languages in the following way. 1) We use a Code LLM to synthesize tests for commented code from a high-resource language, filtering out faulty tests and code with low test coverage. 2) We use a Code LLM to translate Python code to a target low-resource language, and use tests to validate the translation. We apply this approach to generate tens of thousands of validated training items for Julia, Lua, OCaml, R, and Racket. Furthermore, we use an open model (StarCoderBase) with open training data (The Stack), which allows us to decontaminate benchmarks, train models without violating licenses, and run experiments that could not otherwise be done. With MultiPL-T generated data, we present fine-tuned versions of StarCoderBase and Code Llama for Julia, Lua, OCaml, R, and Racket. On established benchmarks (MultiPL-E), these models outperform other open Code LLMs. The MultiPL-T approach is easy to apply to new languages, and is significantly more efficient and effective than alternatives such as training longer.

Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs

TL;DR

Code LLM performance varies dramatically across programming languages due to data availability. The authors present MultiPL-T, a semi-synthetic pipeline that transfers high-resource language data (Python) into low-resource languages (Julia, Lua, OCaml, R, Racket) by generating unit tests in the high-resource language, translating code, compiling tests into the target language, and filtering translations through test success. This approach yields state-of-the-art results on MultiPL-E benchmarks across multiple model families and sizes, proving data efficiency and cross-model transfer benefits; it also demonstrates robustness through ablations and qualitative evaluations. The work provides open datasets and models, enabling reproducibility and practical application to additional languages, and argues for broader benchmarks and future integration with self-instruction methods. Overall, MultiPL-T offers a scalable, efficient method to democratize Code LLM capabilities for low-resource languages with tangible performance gains.

Abstract

Over the past few years, Large Language Models of Code (Code LLMs) have started to have a significant impact on programming practice. Code LLMs are also emerging as building blocks for research in programming languages and software engineering. However, Code LLMs produce impressive results on programming languages that are well represented in their training data (e.g., Java, Python, or JavaScript), but struggle with low-resource languages that have limited training data available. Low resource languages include OCaml, Racket, and several others. This paper presents an effective approach for boosting the performance of Code LLMs on low-resource languages using semi-synthetic data. Our approach, MultiPL-T, translates training data from high-resource languages into training data for low-resource languages in the following way. 1) We use a Code LLM to synthesize tests for commented code from a high-resource language, filtering out faulty tests and code with low test coverage. 2) We use a Code LLM to translate Python code to a target low-resource language, and use tests to validate the translation. We apply this approach to generate tens of thousands of validated training items for Julia, Lua, OCaml, R, and Racket. Furthermore, we use an open model (StarCoderBase) with open training data (The Stack), which allows us to decontaminate benchmarks, train models without violating licenses, and run experiments that could not otherwise be done. With MultiPL-T generated data, we present fine-tuned versions of StarCoderBase and Code Llama for Julia, Lua, OCaml, R, and Racket. On established benchmarks (MultiPL-E), these models outperform other open Code LLMs. The MultiPL-T approach is easy to apply to new languages, and is significantly more efficient and effective than alternatives such as training longer.
Paper Structure (59 sections, 12 figures, 9 tables, 1 algorithm)

This paper contains 59 sections, 12 figures, 9 tables, 1 algorithm.

Figures (12)

  • Figure 1: The performance of StarCoderBase-15B on several languages supported by the MultiPL-E benchmark for Code LLMs, plotted against their proportion of the model's training data. Using MultiPL-T, this paper significantly improves how StarCoderBase-15B performs on several low-resource languages, as shown by the arrows. The bottom of each arrow indicates how the base model performs, and the arrowheads indicate performance after fine-tuning with MultiPL-T. We also show significant improvement on other LLMs (\ref{['evaluation']}).
  • Figure 2: A high-level overview of how MultiPL-T produces high-quality training data for a low-resource programming language. We use a Code LLM to translate a function from a high resource language (①) to a low-resource language (②). The translated code is likely to be wrong, since LLMs perform poorly on low-resource languages. However, we filter out bad translations as follows. First, we generate unit tests the original code (③). We execute these tests to ensure they succeed and also check for test coverage. Second, we compile these tests to the low-resource language (④). Finally, we filter the low-resource code (②) using the translated tests (④), only keeping those that pass tests.
  • Figure 3: An example prompt from a HumanEval problem and its translation to OCaml, with our extension to MultiPL-E. Not shown are doctests and hidden test cases, which are also translated to OCaml. This particular problem is hard for many LLMs because it alters the strong prior on what vowels are, by saying that y is a vowel when it is the last letter in a word.
  • Figure 4: We fine-tune StarCoderBase-1B on several epochs of language-specific data of The Stack and measure performance with MultiPL-E. In \ref{['strawman-just-train-more-full']}, we train on all data from The Stack for each language. These datasets vary in size (the labels measure their size in tokens). In \ref{['strawman-just-train-more-subset']} we sample each dataset to be approximately the same size as the MultiPL-T datasets. Both approaches that use data from The Stack barely improve performance, and can even hurt performance. In contrast, fine-tuning on MultiPL-T (dashed lines) shows significant improvement.
  • Figure 5: Faulty Racket code generated by StarCoderBase-15B when seeded with five hand-written examples.
  • ...and 7 more figures