Table of Contents
Fetching ...

Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

Xinran Gu, Kaifeng Lyu, Jiazheng Li, Jingzhao Zhang

TL;DR

The paper uncovers phase transitions in knowledge acquisition when knowledge-dense data are mixed with web-scale data, showing that memorization behavior can shift abruptly with model size or mixing ratio. It develops an information-theoretic, capacity-allocation framework (knapsack-like) that predicts when knowledge-dense data become valuable to learn, including a power-law relationship between the critical mixing ratio and model size. Through controlled experiments on synthetic SynBio biographies and real-world WikiBio data, it demonstrates sharp transitions and proposes practical mitigation strategies—Random Subsampling and Compact Knowledge Mixing (CKM)—that substantially improve knowledge uptake while preserving general capabilities. The work highlights the importance of data-mixing design, especially for smaller models, and offers actionable guidance for efficient pre-training and continual learning in mixed-data regimes.

Abstract

Large Language Models (LLMs) are typically trained on data mixtures: most data come from web scrapes, while a small portion is curated from high-quality sources with dense domain-specific knowledge. In this paper, we show that when training LLMs on such data mixtures, knowledge acquisition from knowledge-dense datasets, unlike training exclusively on knowledge-dense data (arXiv:2404.05405), does not always follow a smooth scaling law but can exhibit phase transitions with respect to the mixing ratio and model size. Through controlled experiments on a synthetic biography dataset mixed with web-scraped data, we demonstrate that: (1) as we increase the model size to a critical value, the model suddenly transitions from memorizing very few to most of the biographies; (2) below a critical mixing ratio, the model memorizes almost nothing even with extensive training, but beyond this threshold, it rapidly memorizes more biographies. We attribute these phase transitions to a capacity allocation phenomenon: a model with bounded capacity must act like a knapsack problem solver to minimize the overall test loss, and the optimal allocation across datasets can change discontinuously as the model size or mixing ratio varies. We formalize this intuition in an information-theoretic framework and reveal that these phase transitions are predictable, with the critical mixing ratio following a power-law relationship with the model size. Our findings highlight a concrete case where a good mixing recipe for large models may not be optimal for small models, and vice versa.

Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

TL;DR

The paper uncovers phase transitions in knowledge acquisition when knowledge-dense data are mixed with web-scale data, showing that memorization behavior can shift abruptly with model size or mixing ratio. It develops an information-theoretic, capacity-allocation framework (knapsack-like) that predicts when knowledge-dense data become valuable to learn, including a power-law relationship between the critical mixing ratio and model size. Through controlled experiments on synthetic SynBio biographies and real-world WikiBio data, it demonstrates sharp transitions and proposes practical mitigation strategies—Random Subsampling and Compact Knowledge Mixing (CKM)—that substantially improve knowledge uptake while preserving general capabilities. The work highlights the importance of data-mixing design, especially for smaller models, and offers actionable guidance for efficient pre-training and continual learning in mixed-data regimes.

Abstract

Large Language Models (LLMs) are typically trained on data mixtures: most data come from web scrapes, while a small portion is curated from high-quality sources with dense domain-specific knowledge. In this paper, we show that when training LLMs on such data mixtures, knowledge acquisition from knowledge-dense datasets, unlike training exclusively on knowledge-dense data (arXiv:2404.05405), does not always follow a smooth scaling law but can exhibit phase transitions with respect to the mixing ratio and model size. Through controlled experiments on a synthetic biography dataset mixed with web-scraped data, we demonstrate that: (1) as we increase the model size to a critical value, the model suddenly transitions from memorizing very few to most of the biographies; (2) below a critical mixing ratio, the model memorizes almost nothing even with extensive training, but beyond this threshold, it rapidly memorizes more biographies. We attribute these phase transitions to a capacity allocation phenomenon: a model with bounded capacity must act like a knapsack problem solver to minimize the overall test loss, and the optimal allocation across datasets can change discontinuously as the model size or mixing ratio varies. We formalize this intuition in an information-theoretic framework and reveal that these phase transitions are predictable, with the critical mixing ratio following a power-law relationship with the model size. Our findings highlight a concrete case where a good mixing recipe for large models may not be optimal for small models, and vice versa.

Paper Structure

This paper contains 89 sections, 8 theorems, 28 equations, 14 figures, 14 tables, 1 algorithm.

Key Result

Theorem 4.2

For all $M \ge 0$, if all the facts have the same exposure frequency $p$, then where $H_{\mathrm{tot}} := \sum_{i=1}^{K} H(\mathcal{Y}_i)$ and $C := F_{\mathcal{P}}(\infty)$.

Figures (14)

  • Figure 1: Phase transition in model size. For each mixing ratio, as model size increases, accuracy initially remains zero. Once model size surpasses some threshold, accuracy rapidly grows to over 60%.
  • Figure 2: Phase transition in mixing ratio. For each model size, as mixing ratio $r$ increases, accuracy initially remains zero. Only when $r$ exceeds some threshold does accuracy quickly improve.
  • Figure 3: Training longer barely helps for low mixing ratios, with the required training steps to reach a target accuracy grow exponentially or even superexponentially with $1/r$. We train 70M models on the mixture of FineWeb-Edu and SynBio-320k with $r$ ranging from 0.2 to 0.8.
  • Figure 4: For 410M models trained on FineWeb-Edu + SynBio-1.28M, acc. for $r=0.2$ remains near zero even with 4x more training.
  • Figure 5: Similar phase transitions for the slope calculation subtask persist when we mix the modified OpenWebMath with FineWeb-Edu. The model size for (b) is 70M.
  • ...and 9 more figures

Theorems & Definitions (18)

  • Definition 4.1: Optimal Bounded-Capacity Learner
  • Theorem 4.2
  • Theorem 4.3: Phase Transition in Model Size
  • Lemma F.1
  • proof
  • Definition F.2: Factual Data Universe
  • Theorem F.3: \ref{['thm:warmup']}, restated
  • proof
  • Definition F.4: Mixture of Data Universes
  • Definition F.5: Orthogonal Mixture of Data Universes
  • ...and 8 more