Table of Contents
Fetching ...

ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

Yekun Chai, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu

TL;DR

<3-5 sentence high-level summary> ERNIE-Code tackles English-centric limitations in code-oriented multilingual tasks by pretraining a unified NL-PL model over 116 NLs and 6 PLs. It introduces span-corruption language modeling (SCLM) and pivot-based translation language modeling (PTLM) within a T5-based encoder-decoder and a shared NL/PL encoding scheme, using both monolingual and parallel corpora. The model shows state-of-the-art performance across multilingual code-to-text, text-to-code, code-to-code, and text-to-text tasks, with notable zero-shot capabilities via prompting. The work also provides a multilingual NL-PL benchmark and discusses scaling and the curse of multilinguality as directions for future work.

Abstract

Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We release our code and pre-trained checkpoints.

ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

TL;DR

<3-5 sentence high-level summary> ERNIE-Code tackles English-centric limitations in code-oriented multilingual tasks by pretraining a unified NL-PL model over 116 NLs and 6 PLs. It introduces span-corruption language modeling (SCLM) and pivot-based translation language modeling (PTLM) within a T5-based encoder-decoder and a shared NL/PL encoding scheme, using both monolingual and parallel corpora. The model shows state-of-the-art performance across multilingual code-to-text, text-to-code, code-to-code, and text-to-text tasks, with notable zero-shot capabilities via prompting. The work also provides a multilingual NL-PL benchmark and discusses scaling and the curse of multilinguality as directions for future work.

Abstract

Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We release our code and pre-trained checkpoints.
Paper Structure (55 sections, 3 equations, 14 figures, 11 tables)

This paper contains 55 sections, 3 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: Comparison among (a) Multilingual code pre-training; (b) Multilingual text pre-training; (c) Universal multilingual text-code pre-training (ours).
  • Figure 2: Schematic of the SCLM objective for PL (left) and NL (right) example.
  • Figure 3: Schematic of the PTLM objective for NL-to-PL (left), PL-to-NL (middle), NL-to-NL (right) example. "<SEP>" indicates the delimiter token.
  • Figure 4: Semantic and syntactic comparison on multilingual text-to-code generation. All comparison models are evaluated under "translate-train" settings by default, unless otherwise specified (i.e., "zero-shot").
  • Figure 5: Ablation test performance (log-scale). The reported results are averaged among all subtasks.
  • ...and 9 more figures