Table of Contents
Fetching ...

Unraveling the Potential of Large Language Models in Code Translation: How Far Are We?

Qingxiao Tao, Tingrui Yu, Xiaodong Gu, Beijun Shen

TL;DR

A large-scale empirical study to exploit the capabilities and incapabilities of LLMs in code translation tasks and proposes two methods: intermediary translation which selects an intermediary language between the source and target ones and self-training which fine-tunes LLMs on self-generated parallel data.

Abstract

While large language models (LLMs) exhibit state-of-the-art performance in various tasks, recent studies have revealed their struggle for code translation. This is because they haven't been extensively pre-trained with parallel multilingual code, which code translation heavily depends on. Moreover, existing benchmarks only cover a limited subset of common programming languages, and thus cannot reflect the full potential of LLMs in code translation. In this paper, we conduct a large-scale empirical study to exploit the capabilities and incapabilities of LLMs in code translation tasks. We first craft a novel benchmark called PolyHumanEval by extending HumanEval to a multilingual benchmark of 14 languages. With PolyHumanEval, we then perform over 110,000 translations with bleeding-edge code LLMs. The result shows LLMs' suboptimal performance on Python to other languages and the negligible impact of widely adopted LLM optimization techniques such as conventional pre-training and instruction tuning on code translation. To further uncover the potential of LLMs in code translation, we propose two methods: (1) intermediary translation which selects an intermediary language between the source and target ones; and (2) self-training which fine-tunes LLMs on self-generated parallel data. Evaluated with CodeLlama-13B, our approach yields an average improvement of 11.7% computation accuracy on Python-to-other translations. Notably, we interestingly find that Go can serve as a lingua franca for translating between any two studied languages.

Unraveling the Potential of Large Language Models in Code Translation: How Far Are We?

TL;DR

A large-scale empirical study to exploit the capabilities and incapabilities of LLMs in code translation tasks and proposes two methods: intermediary translation which selects an intermediary language between the source and target ones and self-training which fine-tunes LLMs on self-generated parallel data.

Abstract

While large language models (LLMs) exhibit state-of-the-art performance in various tasks, recent studies have revealed their struggle for code translation. This is because they haven't been extensively pre-trained with parallel multilingual code, which code translation heavily depends on. Moreover, existing benchmarks only cover a limited subset of common programming languages, and thus cannot reflect the full potential of LLMs in code translation. In this paper, we conduct a large-scale empirical study to exploit the capabilities and incapabilities of LLMs in code translation tasks. We first craft a novel benchmark called PolyHumanEval by extending HumanEval to a multilingual benchmark of 14 languages. With PolyHumanEval, we then perform over 110,000 translations with bleeding-edge code LLMs. The result shows LLMs' suboptimal performance on Python to other languages and the negligible impact of widely adopted LLM optimization techniques such as conventional pre-training and instruction tuning on code translation. To further uncover the potential of LLMs in code translation, we propose two methods: (1) intermediary translation which selects an intermediary language between the source and target ones; and (2) self-training which fine-tunes LLMs on self-generated parallel data. Evaluated with CodeLlama-13B, our approach yields an average improvement of 11.7% computation accuracy on Python-to-other translations. Notably, we interestingly find that Go can serve as a lingua franca for translating between any two studied languages.

Paper Structure

This paper contains 29 sections, 4 figures, 11 tables.

Figures (4)

  • Figure 1: An Illustration of the Four Prompt Designs in Our Experiments.
  • Figure 2: An Illustration of Intermediary Translation.
  • Figure 3: An Illustration of Self-training.
  • Figure 4: Examples of Python$\rightarrow$Java Translations.