Table of Contents
Fetching ...

LEIA: Facilitating Cross-lingual Knowledge Transfer in Language Models with Entity-based Data Augmentation

Ikuya Yamada, Ryokan Ri

TL;DR

LEIA tackles the challenge of cross-lingual transfer by injecting cross-lingual supervision from English into target-language model training through entity-based data augmentation. It augments target-language Wikipedia text with English entity names placed inside <translate> tokens adjacent to hyperlinks, and trains autoregressively on this augmented corpus. Evaluations on 7B LLMs (LLaMA 2 and Swallow) across diverse non-English QA tasks show consistent gains over strong baselines, demonstrating effective English-to-target-language knowledge transfer. The work highlights the viability of cross-lingual entity supervision for multilingual LLM adaptation and points to future extensions to other corpora and pretraining regimes.

Abstract

Adapting English-based large language models (LLMs) to other languages has become increasingly popular due to the efficiency and potential of cross-lingual transfer. However, existing language adaptation methods often overlook the benefits of cross-lingual supervision. In this study, we introduce LEIA, a language adaptation tuning method that utilizes Wikipedia entity names aligned across languages. This method involves augmenting the target language corpus with English entity names and training the model using left-to-right language modeling. We assess LEIA on diverse question answering datasets using 7B-parameter LLMs, demonstrating significant performance gains across various non-English languages. The source code is available at https://github.com/studio-ousia/leia.

LEIA: Facilitating Cross-lingual Knowledge Transfer in Language Models with Entity-based Data Augmentation

TL;DR

LEIA tackles the challenge of cross-lingual transfer by injecting cross-lingual supervision from English into target-language model training through entity-based data augmentation. It augments target-language Wikipedia text with English entity names placed inside <translate> tokens adjacent to hyperlinks, and trains autoregressively on this augmented corpus. Evaluations on 7B LLMs (LLaMA 2 and Swallow) across diverse non-English QA tasks show consistent gains over strong baselines, demonstrating effective English-to-target-language knowledge transfer. The work highlights the viability of cross-lingual entity supervision for multilingual LLM adaptation and points to future extensions to other corpora and pretraining regimes.

Abstract

Adapting English-based large language models (LLMs) to other languages has become increasingly popular due to the efficiency and potential of cross-lingual transfer. However, existing language adaptation methods often overlook the benefits of cross-lingual supervision. In this study, we introduce LEIA, a language adaptation tuning method that utilizes Wikipedia entity names aligned across languages. This method involves augmenting the target language corpus with English entity names and training the model using left-to-right language modeling. We assess LEIA on diverse question answering datasets using 7B-parameter LLMs, demonstrating significant performance gains across various non-English languages. The source code is available at https://github.com/studio-ousia/leia.
Paper Structure (17 sections, 6 figures, 12 tables)

This paper contains 17 sections, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Data augmentation of LEIA applied to text from Chinese Wikipedia. English entity names, resolved through the inter-language links, enclosed in special <translate> and </translate> tokens are inserted adjacent to hyperlinks to facilitate cross-lingual transfer.
  • Figure 2: Prompt for X-CSQA.
  • Figure 3: Prompt for JEMHopQA in llm-jp-eval.
  • Figure 4: Prompt for NIILC in llm-jp-eval.
  • Figure 5: Prompt for JCommonsenseQA in JP Language Model Evaluation Harness.
  • ...and 1 more figures