Table of Contents
Fetching ...

Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities

Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Hiroki Iida, Masanari Ohi, Kakeru Hattori, Hirai Shota, Sakae Mizuki, Rio Yokota, Naoaki Okazaki

TL;DR

This work investigates cross-lingual continual pre-training to enhance Japanese capabilities in English-pretrained LLMs by building Swallow on extended Llama 2 vocab and large-scale Japanese data. It systematically analyzes data scale, model size, vocabulary expansion, and parallel corpora, showing that continual pre-training yields strong gains in Japanese tasks, with monotonic improvements up to 100B tokens and notable QA and translation benefits. Vocabulary expansion improves efficiency with minimal accuracy loss in most tasks (except summarization), while parallel corpora substantially boost translation quality without harming other tasks. The results offer practical guidance for efficient cross-lingual adaptation of LLMs and establish Swallow as a competitive Japanese-capable model in the ecosystem.

Abstract

Cross-lingual continual pre-training of large language models (LLMs) initially trained on English corpus allows us to leverage the vast amount of English language resources and reduce the pre-training cost. In this study, we constructed Swallow, an LLM with enhanced Japanese capability, by extending the vocabulary of Llama 2 to include Japanese characters and conducting continual pre-training on a large Japanese web corpus. Experimental results confirmed that the performance on Japanese tasks drastically improved through continual pre-training, and the performance monotonically increased with the amount of training data up to 100B tokens. Consequently, Swallow achieved superior performance compared to other LLMs that were trained from scratch in English and Japanese. An analysis of the effects of continual pre-training revealed that it was particularly effective for Japanese question answering tasks. Furthermore, to elucidate effective methodologies for cross-lingual continual pre-training from English to Japanese, we investigated the impact of vocabulary expansion and the effectiveness of incorporating parallel corpora. The results showed that the efficiency gained through vocabulary expansion had no negative impact on performance, except for the summarization task, and that the combined use of parallel corpora enhanced translation ability.

Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities

TL;DR

This work investigates cross-lingual continual pre-training to enhance Japanese capabilities in English-pretrained LLMs by building Swallow on extended Llama 2 vocab and large-scale Japanese data. It systematically analyzes data scale, model size, vocabulary expansion, and parallel corpora, showing that continual pre-training yields strong gains in Japanese tasks, with monotonic improvements up to 100B tokens and notable QA and translation benefits. Vocabulary expansion improves efficiency with minimal accuracy loss in most tasks (except summarization), while parallel corpora substantially boost translation quality without harming other tasks. The results offer practical guidance for efficient cross-lingual adaptation of LLMs and establish Swallow as a competitive Japanese-capable model in the ecosystem.

Abstract

Cross-lingual continual pre-training of large language models (LLMs) initially trained on English corpus allows us to leverage the vast amount of English language resources and reduce the pre-training cost. In this study, we constructed Swallow, an LLM with enhanced Japanese capability, by extending the vocabulary of Llama 2 to include Japanese characters and conducting continual pre-training on a large Japanese web corpus. Experimental results confirmed that the performance on Japanese tasks drastically improved through continual pre-training, and the performance monotonically increased with the amount of training data up to 100B tokens. Consequently, Swallow achieved superior performance compared to other LLMs that were trained from scratch in English and Japanese. An analysis of the effects of continual pre-training revealed that it was particularly effective for Japanese question answering tasks. Furthermore, to elucidate effective methodologies for cross-lingual continual pre-training from English to Japanese, we investigated the impact of vocabulary expansion and the effectiveness of incorporating parallel corpora. The results showed that the efficiency gained through vocabulary expansion had no negative impact on performance, except for the summarization task, and that the combined use of parallel corpora enhanced translation ability.
Paper Structure (46 sections, 11 figures, 10 tables)

This paper contains 46 sections, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Relative change in performance of Swallow compared to $\mathtt{Llama\ 2}$ . Japanese tasks (left, see Table \ref{['tab:eval_benchmark_ja']} for task details) improved by up to approximately 70%.
  • Figure 2: Joint distribution of $\mathtt{Llama\ 2}$ (x-axis) and Swallow (y-axis) scores (character F1, with 1.0 representing an exact match) for NIILC questions.
  • Figure 3: Scalability of continual pre-training on Japanese tasks. Score at 0B tokens corresponds to the baseline performance of the $\mathtt{Llama\ 2}$ model.
  • Figure 4: Relative change in performance with versus without vocabulary expansion (Swallow vs. Swallow$\neg \mathtt{VE}$).
  • Figure 5: Relative change in performance when using parallel corpus compared to Swallow$\neg \mathtt{VE}$.
  • ...and 6 more figures