Table of Contents
Fetching ...

Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation

Luca Moroni, Giovanni Puccetti, Pere-Lluis Huguet Cabot, Andrei Stefan Bejgu, Edoardo Barba, Alessio Miaschi, Felice Dell'Orletta, Andrea Esuli, Roberto Navigli

TL;DR

This work tackles the inefficiency of English-centric LLMs when adapting to Italian by focusing on tokenizer fertility and embedding misalignment. It introduces Semantic Alignment Vocabulary Adaptation (SAVA), a linear-mapping approach that leverages a helper Italian model to align embeddings and map new tokens, and compares it against Random, Fast Vocabulary Transfer (FVT), and CLP across Mistral-7B-v0.1 and Llama-3.1-8B. The results show substantial fertility reductions (about 25% for Mistral-7B-v0.1 and 16% for Llama-3.1-8B) and significant vocabulary and parameter reductions (up to 75% fewer tokens and 10% fewer parameters for Llama-3.1-8B), with SAVA often yielding faster convergence and strong downstream performance on Italian benchmarks and competitive English performance after continual training. The authors also analyze embedding structure to explain performance differences and discuss limitations, including dataset scope and translation-based evaluation, suggesting directions for extending vocabulary-adaptation methods to additional languages and helper-model choices.

Abstract

The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token "fertility") and slower inference speed. In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7b-v0.1, reducing token fertility by 25\%, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.

Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation

TL;DR

This work tackles the inefficiency of English-centric LLMs when adapting to Italian by focusing on tokenizer fertility and embedding misalignment. It introduces Semantic Alignment Vocabulary Adaptation (SAVA), a linear-mapping approach that leverages a helper Italian model to align embeddings and map new tokens, and compares it against Random, Fast Vocabulary Transfer (FVT), and CLP across Mistral-7B-v0.1 and Llama-3.1-8B. The results show substantial fertility reductions (about 25% for Mistral-7B-v0.1 and 16% for Llama-3.1-8B) and significant vocabulary and parameter reductions (up to 75% fewer tokens and 10% fewer parameters for Llama-3.1-8B), with SAVA often yielding faster convergence and strong downstream performance on Italian benchmarks and competitive English performance after continual training. The authors also analyze embedding structure to explain performance differences and discuss limitations, including dataset scope and translation-based evaluation, suggesting directions for extending vocabulary-adaptation methods to additional languages and helper-model choices.

Abstract

The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token "fertility") and slower inference speed. In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7b-v0.1, reducing token fertility by 25\%, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.

Paper Structure

This paper contains 32 sections, 8 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Fertility for two different tokenizers, Mistral-7B-v0.1 (left) and Minerva (right), over Italian texts from CulturaX (blue) and Wikipedia (red).
  • Figure 2: Average performance of Mistral-7B-v0.1 based models during training on Italian translated benchmarks. The average was calculated over six datasets.
  • Figure 3: Average performance of Llama-3.1-8B based models during training on Italian translated benchmarks. The average was calculated over six datasets.
  • Figure 4: Average performance of Mistral-7B-v0.1 based models during training on English benchmarks. The average was calculated over six datasets.
  • Figure 5: Average performance of Llama-3.1-8B based models during training on English benchmarks. The average was calculated over six datasets.
  • ...and 5 more figures