Enhancing Translation Accuracy of Large Language Models through Continual Pre-Training on Parallel Data

Minato Kondo; Takehito Utsuro; Masaaki Nagata

Enhancing Translation Accuracy of Large Language Models through Continual Pre-Training on Parallel Data

Minato Kondo, Takehito Utsuro, Masaaki Nagata

TL;DR

This paper tackles translating with relatively small LLMs by introducing a two-phase training pipeline: continual pre-training on parallel data and subsequent supervised fine-tuning with a small, high-quality parallel corpus. It systematically compares eight data formats across Ja↔En directions using a 3.8B-parameter model, showing that interleaved and tagged data formats, plus direction-aware cues, yield higher translation accuracy than naive concatenation. The study finds that CP effectiveness is direction-dependent and that LLM-based translation models exhibit robustness to spoken language, often requiring less data than encoder–decoder baselines. By combining CP with targeted SFT, the approach achieves superior performance on 13 test sets, suggesting practical implications for deploying translation systems with resource-constrained LLMs. The results encourage exploring LoRA and other parameter-efficient tuning methods on larger LLMs and additional language pairs, while recognizing limitations in scope and data quality considerations.

Abstract

In this paper, we propose a two-phase training approach where pre-trained large language models are continually pre-trained on parallel data and then supervised fine-tuned with a small amount of high-quality parallel data. To investigate the effectiveness of our proposed approach, we conducted continual pre-training with a 3.8B-parameter model and parallel data across eight different formats. We evaluate these methods on thirteen test sets for Japanese-to-English and English-to-Japanese translation. The results demonstrate that when utilizing parallel data in continual pre-training, it is essential to alternate between source and target sentences. Additionally, we demonstrated that the translation accuracy improves only for translation directions where the order of source and target sentences aligns between continual pre-training data and inference. In addition, we demonstrate that the LLM-based translation model is more robust in translating spoken language and achieves higher accuracy with less training data compared to supervised encoder-decoder models. We also show that the highest accuracy is achieved when the data for continual pre-training consists of interleaved source and target sentences and when tags are added to the source sentences.

Enhancing Translation Accuracy of Large Language Models through Continual Pre-Training on Parallel Data

TL;DR

Abstract

Paper Structure (45 sections, 2 equations, 3 figures, 10 tables)

This paper contains 45 sections, 2 equations, 3 figures, 10 tables.

Introduction
Related Work
Parallel Data in Pre-Training from Scratch
LLMs-Based Translation Models
Continual Pre-Training and Supervised Fine-Tuning with Parallel Data
Continual Pre-Training
Supervised Fine-Tuning
Experiments
Dataset
Continual Pre-Training
Supervised Fine-Tuning
Test Sets
Models
Baseline Models
Transformer
...and 30 more sections

Figures (3)

Figure 1: Radar chart of COMET score. Blue line indicates the accuracy of the Transformer, while red line represents the accuracy of the model continually pre-trained with Mix format followed by supervised fine-tuning with full weight. Underlines indicate test sets with a significant difference compared to the Transformer ($p<0.05$).
Figure 2: Data curves for BLEU and COMET scores on WMT22 test data for Transformer, Direct-SFT, and Mix. Mix has been evaluated after completing supervised fine-tuning with LoRA tuning following continual pre-training. We experimented with data amounts of 10%, 20%, 30%, 50%, and 100% due to computational resource constraints. For the Transformer, we varied the proportion of data from JParaCrawl v3.0. At the same time, for Direct-SFT and Mix, since training was conducted for only one epoch, we consider the proportion of checkpoints equal to that of the training data and report the accuracy for each checkpoint.
Figure 3: Data Curves for BLEU and COMET scores at each 10% checkpoint of En-Ja2Mix for En $\Rightarrow$ Ja on WMT22 test data. All models at checkpoints have undergone supervised fine-tuning. The 0% on the x-axis represents the accuracy of En-Ja.

Enhancing Translation Accuracy of Large Language Models through Continual Pre-Training on Parallel Data

TL;DR

Abstract

Enhancing Translation Accuracy of Large Language Models through Continual Pre-Training on Parallel Data

Authors

TL;DR

Abstract

Table of Contents

Figures (3)