Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages

Yuanchi Zhang; Yile Wang; Zijun Liu; Shuo Wang; Xiaolong Wang; Peng Li; Maosong Sun; Yang Liu

Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages

Yuanchi Zhang, Yile Wang, Zijun Liu, Shuo Wang, Xiaolong Wang, Peng Li, Maosong Sun, Yang Liu

TL;DR

Experimental results demonstrate that SDRRL can significantly enhance multilingual capabilities while minimizing the impact on original performance in resource-rich languages.

Abstract

While large language models (LLMs) have been pre-trained on multilingual corpora, their performance still lags behind in most languages compared to a few resource-rich languages. One common approach to mitigate this issue is to translate training data from resource-rich languages into other languages and then continue training. However, using the data obtained solely relying on translation while ignoring the original capabilities of LLMs across languages is not always effective, which we show will limit the performance of cross-lingual knowledge transfer. In this work, we propose SDRRL, a method based on Self-Distillation from Resource-Rich Languages that effectively improve multilingual performance by leveraging the internal capabilities of LLMs on resource-rich languages. We evaluate on different LLMs (LLaMA-2 and SeaLLM) and source languages across various comprehension and generation tasks, experimental results demonstrate that SDRRL can significantly enhance multilingual capabilities while minimizing the impact on original performance in resource-rich languages.

Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages

TL;DR

Experimental results demonstrate that SDRRL can significantly enhance multilingual capabilities while minimizing the impact on original performance in resource-rich languages.

Abstract

Paper Structure (29 sections, 6 equations, 3 figures, 13 tables)

This paper contains 29 sections, 6 equations, 3 figures, 13 tables.

Introduction
Related Work
Method
SFT and Translate-then-SFT Paradigm
Self-Distillation from Resource-Rich Languages (SDRRL)
Transfer Set Construction
Transfer Set Translation
Applying Code-Switching
Incorporating External Parallel Corpus
Training Objective
Experiments
Setup
Implementation Details
Baselines.
Datasets.
...and 14 more sections

Figures (3)

Figure 1: Comparison between vanilla supervised fine-tuning (SFT), translate-then-SFT, and our proposed method. Besides using the translated question-answer pairs in the target language (e.g., Japanese), SDRRL further leverages the generated answer $A^{\star}_{\rm EN}$ by LLMs in the resource-rich language (e.g., English) and collects self-distillated data (in green box) to help enhance its multilingual capabilities.
Figure 2: t-SNE visualizations of output representations by LLaMA-2 before and after applying SDRRL. The markers in red and blue represent semantically equivalent instructions in different languages.
Figure 3: The occurrence rate of off-target issues in various languages during the SDRRL process.

Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages

TL;DR

Abstract

Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages

Authors

TL;DR

Abstract

Table of Contents

Figures (3)