Table of Contents
Fetching ...

Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP

François Remy, Pieter Delobelle, Hayastan Avetisyan, Alfiya Khabibullina, Miryam de Lhoneux, Thomas Demeester

TL;DR

This work introduces trans-tokenization, a cross-lingual vocabulary transfer method that uses SMT-based token alignment to initialize target-language embeddings from a source-language model, enabling efficient adaptation of high-resource LLMs to low-resource languages. It also presents Hydra LLMs, architectures with multiple swappable embedding tables and heads to support zero-shot cross-lingual tasks and translations. The authors validate their approach with Tweety trans-tokenized models across Tatar, Armenian, and Dutch, achieving competitive perplexities, understanding, summarization, and translation results, including zero-shot MT for Tatar that approaches commercial systems when combined with finetuning. By releasing code, models, and a Tatar summarization dataset, the work aims to democratize language technology for underrepresented languages and encourage broader cross-lingual research and collaboration.

Abstract

The development of monolingual language models for low and mid-resource languages continues to be hindered by the difficulty in sourcing high-quality training data. In this study, we present a novel cross-lingual vocabulary transfer strategy, trans-tokenization, designed to tackle this challenge and enable more efficient language adaptation. Our approach focuses on adapting a high-resource monolingual LLM to an unseen target language by initializing the token embeddings of the target language using a weighted average of semantically similar token embeddings from the source language. For this, we leverage a translation resource covering both the source and target languages. We validate our method with the Tweeties, a series of trans-tokenized LLMs, and demonstrate their competitive performance on various downstream tasks across a small but diverse set of languages. Additionally, we introduce Hydra LLMs, models with multiple swappable language modeling heads and embedding tables, which further extend the capabilities of our trans-tokenization strategy. By designing a Hydra LLM based on the multilingual model TowerInstruct, we developed a state-of-the-art machine translation model for Tatar, in a zero-shot manner, completely bypassing the need for high-quality parallel data. This breakthrough is particularly significant for low-resource languages like Tatar, where high-quality parallel data is hard to come by. By lowering the data and time requirements for training high-quality models, our trans-tokenization strategy allows for the development of LLMs for a wider range of languages, especially those with limited resources. We hope that our work will inspire further research and collaboration in the field of cross-lingual vocabulary transfer and contribute to the empowerment of languages on a global scale.

Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP

TL;DR

This work introduces trans-tokenization, a cross-lingual vocabulary transfer method that uses SMT-based token alignment to initialize target-language embeddings from a source-language model, enabling efficient adaptation of high-resource LLMs to low-resource languages. It also presents Hydra LLMs, architectures with multiple swappable embedding tables and heads to support zero-shot cross-lingual tasks and translations. The authors validate their approach with Tweety trans-tokenized models across Tatar, Armenian, and Dutch, achieving competitive perplexities, understanding, summarization, and translation results, including zero-shot MT for Tatar that approaches commercial systems when combined with finetuning. By releasing code, models, and a Tatar summarization dataset, the work aims to democratize language technology for underrepresented languages and encourage broader cross-lingual research and collaboration.

Abstract

The development of monolingual language models for low and mid-resource languages continues to be hindered by the difficulty in sourcing high-quality training data. In this study, we present a novel cross-lingual vocabulary transfer strategy, trans-tokenization, designed to tackle this challenge and enable more efficient language adaptation. Our approach focuses on adapting a high-resource monolingual LLM to an unseen target language by initializing the token embeddings of the target language using a weighted average of semantically similar token embeddings from the source language. For this, we leverage a translation resource covering both the source and target languages. We validate our method with the Tweeties, a series of trans-tokenized LLMs, and demonstrate their competitive performance on various downstream tasks across a small but diverse set of languages. Additionally, we introduce Hydra LLMs, models with multiple swappable language modeling heads and embedding tables, which further extend the capabilities of our trans-tokenization strategy. By designing a Hydra LLM based on the multilingual model TowerInstruct, we developed a state-of-the-art machine translation model for Tatar, in a zero-shot manner, completely bypassing the need for high-quality parallel data. This breakthrough is particularly significant for low-resource languages like Tatar, where high-quality parallel data is hard to come by. By lowering the data and time requirements for training high-quality models, our trans-tokenization strategy allows for the development of LLMs for a wider range of languages, especially those with limited resources. We hope that our work will inspire further research and collaboration in the field of cross-lingual vocabulary transfer and contribute to the empowerment of languages on a global scale.
Paper Structure (34 sections, 5 figures, 16 tables)

This paper contains 34 sections, 5 figures, 16 tables.

Figures (5)

  • Figure 1: Overview of our Trans-Tokenization method
  • Figure 2: Illustration of the arbitrary nature of token alignment which can be captured by evidence-based SMT mappings (trans-tokenization) but not by character-based mappings.
  • Figure 3: Google Translate results (in orange) are not in line with the otherwise strong cross-task correlations of the other models. We estimate a real score of about 53 instead.
  • Figure 4: We find that neither the English-to-Tatar nor the Russian-to-Tatar mapping perform better for transfer learning through trans-tokenization. We attribute this to the fact that TowerInstruct being trained with corpus of equal size for English and Russian, and that neither language is particularly close to Tatar. However, combining both initializations provides some benefit.
  • Figure 5: Cosine Similarity Analysis of English-initialized and Russian-initialized embeddings of Tatar tokens, revealing only a very limited degree of similarity.