Table of Contents
Fetching ...

A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT

Louis Estève, Christophe Servan, Thomas Lavergne, Agata Savary

TL;DR

This study finds that diversity-driven sampling allows in some tasks to gain 10 points relative to randomly-sampled pre-training data of commensurate size, and compares diversity-driven sampling algorithms, so as to pick the best one.

Abstract

Diversity has been gaining interest in the NLP community in recent years. At the same time, state-of-the-art transformer models such as ModernBERT use very large pre-training datasets, which are driven by size rather than by diversity. This summons for an investigation of the impact of diversity on the ModernBERT pre-training. We do so in this study, with the express intent of reducing pre-training dataset size, while retaining at least comparable performance. We compare diversity-driven sampling algorithms, so as to pick the best one. We find that diversity-driven sampling allows in some tasks to gain 10 points relative to randomly-sampled pre-training data of commensurate size. We also see that a model pre-trained for 483h on a diversity-driven dataset of 150M tokens can yield a commensurate performance to a model pre-trained for 1,775h on a randomly-driven dataset of 2.4B tokens.

A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT

TL;DR

This study finds that diversity-driven sampling allows in some tasks to gain 10 points relative to randomly-sampled pre-training data of commensurate size, and compares diversity-driven sampling algorithms, so as to pick the best one.

Abstract

Diversity has been gaining interest in the NLP community in recent years. At the same time, state-of-the-art transformer models such as ModernBERT use very large pre-training datasets, which are driven by size rather than by diversity. This summons for an investigation of the impact of diversity on the ModernBERT pre-training. We do so in this study, with the express intent of reducing pre-training dataset size, while retaining at least comparable performance. We compare diversity-driven sampling algorithms, so as to pick the best one. We find that diversity-driven sampling allows in some tasks to gain 10 points relative to randomly-sampled pre-training data of commensurate size. We also see that a model pre-trained for 483h on a diversity-driven dataset of 150M tokens can yield a commensurate performance to a model pre-trained for 1,775h on a randomly-driven dataset of 2.4B tokens.
Paper Structure (20 sections, 4 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 20 sections, 4 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Comparison of sampling algorithms run on UD v2.16 (French). lee_non-monotone_2009 have the best entropy, but scholivet-etal-2025-selexini is just below, with only a third in $n'$. For scholivet-etal-2025-selexini, the last value of $E$ is $20$, with each preceding value increasing by $10$ (the only exception is at $\left\vert E \right\vert = 1$ where the only value is $1$).
  • Figure 2: Pre-training process for encoders. Loss is smoothed using a moving average (window size: $2 \times 10^5$, series sizes from $1.5$M to $6.3$M points) applied thrice.
  • Figure 3: Fine-tuning (head only) for MEDIA (full). Dashed thick lines delimit dataset traversals. Thick crosses denote maximum value.