Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough

Konstantin Dobler; Gerard de Melo

Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough

Konstantin Dobler, Gerard de Melo

TL;DR

The paper evaluates language adaptation of Mistral-7B under tight academic compute budgets, focusing on German and Arabic. It finds that training in pure bfloat16 delivers substantial efficiency gains and remains viable, though RMSNorm weights can behave differently due to numeric precision. Tokenizer swapping, paired with good embedding reinitialization, yields efficient tokenization and comparable downstream performance in German, while helping Arabic more noticeably in hindsight experiments. The results suggest language adaptation is not universally beneficial, being more advantageous for underrepresented languages, and offer practical guidance for future budget-conscious adaptation efforts. Overall, the study highlights pure bfloat16 and tokenizer strategies as key levers for efficient language specialization within constrained resources.

Abstract

We investigate continued pretraining of LLMs for language adaptation on a tight academic budget: a setting in which only a few GPUs can be used in parallel, for a heavily constrained duration. We focus on adapting Mistral-7B to German or Arabic and evaluate several techniques to improve efficiency and effectiveness in this setting. Our German models adapted on this tight compute budget underperform compared to the base Mistral-7B, while our Arabic models outperform several baselines, showing that for sufficiently well-represented languages, continued pretraining for specialization is not always helpful. Our main findings focus on training precision and tokenizer swapping. Our results show that pure bfloat16 training is a viable alternative to mixed-precision training, while being much faster when only using a few GPUs. Swapping the tokenizer for a specialized one yields more efficient tokenization and is competitive with the original tokenizer, which already contains some German tokens, but did not significantly increase performance for German. Code and model weights are available at on GitHub.

Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough

TL;DR

Abstract

Paper Structure (33 sections, 1 equation, 3 figures, 11 tables)

This paper contains 33 sections, 1 equation, 3 figures, 11 tables.

Introduction
Background: Precision Types
The Numerics of bfloat16.
Large weights lead to problems when training in pure bfloat16.
Case study: Mistral-7B.
Experimental Setup
Main Experiments
Hindsight Study
Results & Analysis
Analysis: Pure bfloat16 vs. Mixed Precision
Training efficiency gains.
Loss and downstream task results.
The effects of pure bfloat16 numerics.
Analysis: Tokenizer Swapping
On comparing loss between different tokenizers.
...and 18 more sections

Figures (3)

Figure 1: Illustration of the memory layout of float32 and bfloat16, based on fpfigure.
Figure 2: Histogram of absolute individual parameter weight values of Mistral-7B, separately highlighting RMS-Norm and non-RMSNorm weights.
Figure 3: Token-normalized negative log-likelihood (conventional cross-entropy loss) of a held-out test set throughout continued pretraining of Mistral-7B on German text. We compare pure and mixed-precision bfloat16 training. Additionally, we compare swapping the original tokenizer of Mistral-7B with a specialized German tokenizer.

Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough

TL;DR

Abstract

Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough

Authors

TL;DR

Abstract

Table of Contents

Figures (3)