Table of Contents
Fetching ...

Why Less is More (Sometimes): A Theory of Data Curation

Elvis Dohmatob, Mohammad Pezeshki, Reyhane Askari-Hemmat

TL;DR

This work tackles the paradox of when using less data can yield better generalization by developing a principled data-curation framework grounded in high-dimensional theory. It derives exact scaling laws for test error under label-agnostic and label-aware pruning via random-matrix techniques, revealing phase-transition conditions that determine when pruning beats full-data training. The theory connects generator and pruner quality, their alignment, and data scale to optimal pruning strategies, and it validates predictions on synthetic setups, ImageNet, and insights into LLM reasoning. Practically, the results offer a data-centric lens to stabilize learning and mitigate model collapse through principled curation, with broad implications for future data pipelines and self-improvement loops in large models.

Abstract

This paper introduces a theoretical framework to resolve a central paradox in modern machine learning: When is it better to use less data? This question has become critical as classical scaling laws suggesting ``more is more'' (Sun et al., 2025) are challenged by methods like LIMO (``less is more'') and s1 (Ye et al., 2025; Muenighoff et al., 2025), which achieve superior performance with small, aggressively curated datasets. Here, we study data curation strategies where an imperfect oracle selects the training examples according to their difficulty and correctness. Our results provide exact scaling law curves for test error under both label-agnostic and label-aware curation rules, revealing when and why keeping only a subset of data can improve generalization. In contrast to classical scaling laws, we show that under certain conditions, small curated datasets can outperform full datasets, and we provide analytical conditions for this by deriving precise phase transition curves tied to data size and quality. We validate these theoretical claims with empirical results on ImageNet, confirming our predictions about when curation improves accuracy and can even mitigate model collapse. Furthermore, our framework provides a principled explanation for the contradictory curation strategies recently observed in LLM mathematical reasoning.

Why Less is More (Sometimes): A Theory of Data Curation

TL;DR

This work tackles the paradox of when using less data can yield better generalization by developing a principled data-curation framework grounded in high-dimensional theory. It derives exact scaling laws for test error under label-agnostic and label-aware pruning via random-matrix techniques, revealing phase-transition conditions that determine when pruning beats full-data training. The theory connects generator and pruner quality, their alignment, and data scale to optimal pruning strategies, and it validates predictions on synthetic setups, ImageNet, and insights into LLM reasoning. Practically, the results offer a data-centric lens to stabilize learning and mitigate model collapse through principled curation, with broad implications for future data pipelines and self-improvement loops in large models.

Abstract

This paper introduces a theoretical framework to resolve a central paradox in modern machine learning: When is it better to use less data? This question has become critical as classical scaling laws suggesting ``more is more'' (Sun et al., 2025) are challenged by methods like LIMO (``less is more'') and s1 (Ye et al., 2025; Muenighoff et al., 2025), which achieve superior performance with small, aggressively curated datasets. Here, we study data curation strategies where an imperfect oracle selects the training examples according to their difficulty and correctness. Our results provide exact scaling law curves for test error under both label-agnostic and label-aware curation rules, revealing when and why keeping only a subset of data can improve generalization. In contrast to classical scaling laws, we show that under certain conditions, small curated datasets can outperform full datasets, and we provide analytical conditions for this by deriving precise phase transition curves tied to data size and quality. We validate these theoretical claims with empirical results on ImageNet, confirming our predictions about when curation improves accuracy and can even mitigate model collapse. Furthermore, our framework provides a principled explanation for the contradictory curation strategies recently observed in LLM mathematical reasoning.

Paper Structure

This paper contains 71 sections, 21 theorems, 143 equations, 8 figures, 3 tables.

Key Result

Theorem 1

In the limit Eqn. eq:asymptotic, the test error of the model $\hat{w}$ from Eqn. eq:estimator is given by, where $m$, $\tilde{m}$, and $r$ are functions explicitly determined by the constants in Eqn. eq:constants. In particular, $m$ is the Stieltjes transform of a Marchenko-Pastur law, "deformed" by pruning. Refer to Appendix sec:ingredients for details.

Figures (8)

  • Figure 1: Theory Prediction across four key regimes. Test error as a function of fraction of data kept ($p=1$ means keeping all the data) for "keep hard" and "random" pruning. Solid lines are theoretical predictions; dashed lines are empirical results with error bars. The plot reveals that a "more is more" strategy (optimal error at p=1) is the default, holding true for small datasets (top row) or a poor generator (right column). The bottom-left quadrant shows the crucial exception: only when data is abundant and the generator is strong does the "less is more" principle apply, with aggressive pruning yielding the lowest error.
  • Figure 2: The optimal curation strategy depends on the data scale in ImageNet. A clear crossover point emerges as we vary the initial dataset size $n$, shifting the optimal strategy from "keep easy" to "keep hard" as the generator model becomes stronger.
  • Figure 3: Strategic pruning prevents model collapse. Over multiple rounds of pseudo-labeling, training on all examples leads to performance degradation. In contrast, selectively training on only hard, valid examples consistently preserves performance across rounds.
  • Figure 4: Validation of theoretical error predictions against empirical simulations. (A) Scatter plot of theory vs. empirical error across 15 configurations, with diagonal = perfect agreement. (B--D) Parameter sweeps for pruning fraction, sample size, and generator angle. (E) Configuration-wise comparisons. All results use logistic regression with $\lambda = 10^{-6}$.
  • Figure 5: Effect of Label-agnostic curation rule Eqn. \ref{['eq:non-limo']} as proposed in sorscher2022beyond.
  • ...and 3 more figures

Theorems & Definitions (29)

  • Remark 1
  • Theorem 1: Exact Test Error
  • Theorem 2: Optimal Pruning Strategy
  • Theorem 3: Test Error for Label-aware Curation
  • Theorem 4
  • Corollary 1
  • Corollary 2
  • Theorem 5
  • Definition 1: Deterministic Equivalents
  • Proposition 1
  • ...and 19 more