Table of Contents
Fetching ...

ChocoLlama: Lessons Learned From Teaching Llamas Dutch

Matthieu Meeus, Anthony Rathé, François Remy, Pieter Delobelle, Jens-Joris Decorte, Thomas Demeester

TL;DR

ChocoLlama investigates adapting English-dominated LLMs to Dutch through LoRA-based continued pretraining, Dutch tokenizer reinitialization, and posttraining. It demonstrates that LoRA scales for language adaptation in Llama-2 and that a Dutch tokenizer with embedding reinitialization can yield gains, though Llama-3's strong multilingual pretraining often reduces the impact of additional Dutch pretraining. A new Dutch benchmark, ChocoLlama-Bench, and a qualitative evaluation framework reveal that high-quality instruction tuning and domain-focused posttraining drive performance in multilingual LLMs, sometimes surpassing language-adapted baselines. The work highlights the need for robust, language-specific benchmarks and suggests shifting emphasis from broad continued pretraining to targeted posttraining for advancing Dutch LLM capabilities in the era of multilingual foundation models.

Abstract

While Large Language Models (LLMs) have shown remarkable capabilities in natural language understanding and generation, their performance often lags in lower-resource, non-English languages due to biases in the training data. In this work, we explore strategies for adapting the primarily English LLMs (Llama-2 and Llama-3) to Dutch, a language spoken by 30 million people worldwide yet often underrepresented in LLM development. We collect 104GB of Dutch text ($32$B tokens) from various sources to first apply continued pretraining using low-rank adaptation (LoRA), complemented with Dutch posttraining strategies provided by prior work. For Llama-2, we consider using (i) the tokenizer of the original model, and (ii) training a new, Dutch-specific tokenizer combined with embedding reinitialization. We evaluate our adapted models, ChocoLlama-2, both on standard benchmarks and a novel Dutch benchmark, ChocoLlama-Bench. Our results demonstrate that LoRA can effectively scale for language adaptation, and that tokenizer modification with careful weight reinitialization can improve performance. Notably, Llama-3 was released during the course of this project and, upon evaluation, demonstrated superior Dutch capabilities compared to our Dutch-adapted versions of Llama-2. We hence apply the same adaptation technique to Llama-3, using its original tokenizer. While our adaptation methods enhanced Llama-2's Dutch capabilities, we found limited gains when applying the same techniques to Llama-3. This suggests that for ever improving, multilingual foundation models, language adaptation techniques may benefit more from focusing on language-specific posttraining rather than on continued pretraining. We hope this work contributes to the broader understanding of adapting LLMs to lower-resource languages, and to the development of Dutch LLMs in particular.

ChocoLlama: Lessons Learned From Teaching Llamas Dutch

TL;DR

ChocoLlama investigates adapting English-dominated LLMs to Dutch through LoRA-based continued pretraining, Dutch tokenizer reinitialization, and posttraining. It demonstrates that LoRA scales for language adaptation in Llama-2 and that a Dutch tokenizer with embedding reinitialization can yield gains, though Llama-3's strong multilingual pretraining often reduces the impact of additional Dutch pretraining. A new Dutch benchmark, ChocoLlama-Bench, and a qualitative evaluation framework reveal that high-quality instruction tuning and domain-focused posttraining drive performance in multilingual LLMs, sometimes surpassing language-adapted baselines. The work highlights the need for robust, language-specific benchmarks and suggests shifting emphasis from broad continued pretraining to targeted posttraining for advancing Dutch LLM capabilities in the era of multilingual foundation models.

Abstract

While Large Language Models (LLMs) have shown remarkable capabilities in natural language understanding and generation, their performance often lags in lower-resource, non-English languages due to biases in the training data. In this work, we explore strategies for adapting the primarily English LLMs (Llama-2 and Llama-3) to Dutch, a language spoken by 30 million people worldwide yet often underrepresented in LLM development. We collect 104GB of Dutch text (B tokens) from various sources to first apply continued pretraining using low-rank adaptation (LoRA), complemented with Dutch posttraining strategies provided by prior work. For Llama-2, we consider using (i) the tokenizer of the original model, and (ii) training a new, Dutch-specific tokenizer combined with embedding reinitialization. We evaluate our adapted models, ChocoLlama-2, both on standard benchmarks and a novel Dutch benchmark, ChocoLlama-Bench. Our results demonstrate that LoRA can effectively scale for language adaptation, and that tokenizer modification with careful weight reinitialization can improve performance. Notably, Llama-3 was released during the course of this project and, upon evaluation, demonstrated superior Dutch capabilities compared to our Dutch-adapted versions of Llama-2. We hence apply the same adaptation technique to Llama-3, using its original tokenizer. While our adaptation methods enhanced Llama-2's Dutch capabilities, we found limited gains when applying the same techniques to Llama-3. This suggests that for ever improving, multilingual foundation models, language adaptation techniques may benefit more from focusing on language-specific posttraining rather than on continued pretraining. We hope this work contributes to the broader understanding of adapting LLMs to lower-resource languages, and to the development of Dutch LLMs in particular.

Paper Structure

This paper contains 20 sections, 1 equation, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Hand-picked conversational snippet of Llama-3-ChocoLlama-instruct. Answers to the same prompt for other models considered in this paper can be found in Appendix \ref{['app:samples']}.
  • Figure 2: Model perplexity $\mathcal{P}_{\theta}(D_{\text{batch}})$ (left) and normalized perplexity $\mathcal{P}_{\theta}^\text{norm}(D_{\text{batch}})$ (see Equation \ref{['eq_norm_perpl']}) (right) across the ChocoLlama model suite during pretraining over 1 full epoch of the collected Dutch data.