Table of Contents
Fetching ...

Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Bettina Messmer, Vinko Sabolčec, Martin Jaggi

TL;DR

This work tackles the challenge of multilingual LLM pretraining data quality by extending model-based filtering to diverse languages and scripts. It introduces a two-tier approach that builds representative multilingual training sets (MKC and MKC$^+$) and applies both FastText and Transformer embedding-based filters to curate knowledge-rich, structured samples from web-scale data. Through extensive experiments with 1B-parameter LLMs across 20 languages, the study demonstrates that the Transformer-based MLP MKC$^+$ method can match or surpass baselines using only a fraction of the data (as low as 10%–15%) while maintaining or improving performance on multilingual benchmarks; it also analyzes data contamination, thresholding, and multilingual transfer effects. The findings support the generalizability of model-based multilingual data filtering and culminate in a public release of refined pretraining datasets and code for 20 languages, enabling broader, more efficient multilingual LLM development.

Abstract

Dataset curation has become a basis for strong large language model (LLM) performance. While various rule-based filtering heuristics exist for English and multilingual datasets, model-based filtering techniques have primarily focused on English. To address the disparity stemming from limited research on non-English languages, we propose a model-based filtering framework for multilingual datasets that aims to identify a diverse set of structured and knowledge-rich samples. Our approach emphasizes transparency, simplicity, and efficiency, leveraging Transformer- and FastText-based classifiers to ensure the broad accessibility of our technique and data. We conduct comprehensive ablation studies on the FineWeb-2 web crawl dataset across diverse language families, scripts, and resource availability to demonstrate the effectiveness of our method. Training a 1B-parameter Llama model for 70B and 119B tokens, our approach can match the baseline MMLU score with as little as 15% of the training tokens, while also improving across other benchmarks. These findings provide strong evidence for the generalizability of our approach to other languages. As a result, we extend our framework to 20 languages for which we release the refined pretraining datasets.

Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

TL;DR

This work tackles the challenge of multilingual LLM pretraining data quality by extending model-based filtering to diverse languages and scripts. It introduces a two-tier approach that builds representative multilingual training sets (MKC and MKC) and applies both FastText and Transformer embedding-based filters to curate knowledge-rich, structured samples from web-scale data. Through extensive experiments with 1B-parameter LLMs across 20 languages, the study demonstrates that the Transformer-based MLP MKC method can match or surpass baselines using only a fraction of the data (as low as 10%–15%) while maintaining or improving performance on multilingual benchmarks; it also analyzes data contamination, thresholding, and multilingual transfer effects. The findings support the generalizability of model-based multilingual data filtering and culminate in a public release of refined pretraining datasets and code for 20 languages, enabling broader, more efficient multilingual LLM development.

Abstract

Dataset curation has become a basis for strong large language model (LLM) performance. While various rule-based filtering heuristics exist for English and multilingual datasets, model-based filtering techniques have primarily focused on English. To address the disparity stemming from limited research on non-English languages, we propose a model-based filtering framework for multilingual datasets that aims to identify a diverse set of structured and knowledge-rich samples. Our approach emphasizes transparency, simplicity, and efficiency, leveraging Transformer- and FastText-based classifiers to ensure the broad accessibility of our technique and data. We conduct comprehensive ablation studies on the FineWeb-2 web crawl dataset across diverse language families, scripts, and resource availability to demonstrate the effectiveness of our method. Training a 1B-parameter Llama model for 70B and 119B tokens, our approach can match the baseline MMLU score with as little as 15% of the training tokens, while also improving across other benchmarks. These findings provide strong evidence for the generalizability of our approach to other languages. As a result, we extend our framework to 20 languages for which we release the refined pretraining datasets.

Paper Structure

This paper contains 31 sections, 6 figures, 20 tables.

Figures (6)

  • Figure 1: Pretraining benchmark performance (average accuracy) measured on Chinese (CMMLU), German (MMLU), and French (MMLU), while training for 119B tokens, comparing the baseline FineWeb-2 dataset against data filtered using our FastText (FT) and Transformer Multi-Layer Perceptron (MLP) embedding-based filtering methods trained on our data mixture MKC$^+$. When using our approaches, the data retention rates are set to 10%.
  • Figure 2: Benchmark performance comparison (accuracy) during training on 119B tokens between the baseline methods (FineWeb, DCLM, FineWeb-Edu, and FineWeb-2) and our proposed filtering methods (FT, MLP, and CS), trained on MKC$^+$. When using our approaches, the data retention rates are set to 10% for English, Chinese, German, and French, 56% for Arabic, and 65% for Danish. For English, Chinese, German, and French, baseline-level performance is observed around 20B tokens consumed (16.7% of the total).
  • Figure 3: Comparison of average document length and standard deviation in FineWeb-2 before and after filtering using one of our approaches retaining top 10% of the documents. The average document length of FineWeb-2 is represented as a red horizontal line, while the medians are shown as red dots. Document length is measured based on number of space-separated tokens.
  • Figure 4: Comparison of average document length and standard deviation in FineWeb-2 before and after filtering using one of our approaches retaining top 10% of the documents for Chinese and French, 56% for Arabic and 65% for Danish. The average document length of FineWeb-2 is represented as a red horizontal line, while the medians are shown as red dots. Document length is measured based on number of space-separated tokens.
  • Figure 5: Comparison of average document length and standard deviation in FineWeb-2 before and after filtering using MLP filtering method retaining top 10% of the documents with different training datasets. The average document length of FineWeb-2 is represented as a red horizontal line, while the medians are shown as red dots. Document length is measured based on number of space-separated tokens.
  • ...and 1 more figures