Why Less is More (Sometimes): A Theory of Data Curation

Elvis Dohmatob; Mohammad Pezeshki; Reyhane Askari-Hemmat

Why Less is More (Sometimes): A Theory of Data Curation

Elvis Dohmatob, Mohammad Pezeshki, Reyhane Askari-Hemmat

TL;DR

This work tackles the paradox of when using less data can yield better generalization by developing a principled data-curation framework grounded in high-dimensional theory. It derives exact scaling laws for test error under label-agnostic and label-aware pruning via random-matrix techniques, revealing phase-transition conditions that determine when pruning beats full-data training. The theory connects generator and pruner quality, their alignment, and data scale to optimal pruning strategies, and it validates predictions on synthetic setups, ImageNet, and insights into LLM reasoning. Practically, the results offer a data-centric lens to stabilize learning and mitigate model collapse through principled curation, with broad implications for future data pipelines and self-improvement loops in large models.

Abstract

This paper introduces a theoretical framework to resolve a central paradox in modern machine learning: When is it better to use less data? This question has become critical as classical scaling laws suggesting ``more is more'' (Sun et al., 2025) are challenged by methods like LIMO (``less is more'') and s1 (Ye et al., 2025; Muenighoff et al., 2025), which achieve superior performance with small, aggressively curated datasets. Here, we study data curation strategies where an imperfect oracle selects the training examples according to their difficulty and correctness. Our results provide exact scaling law curves for test error under both label-agnostic and label-aware curation rules, revealing when and why keeping only a subset of data can improve generalization. In contrast to classical scaling laws, we show that under certain conditions, small curated datasets can outperform full datasets, and we provide analytical conditions for this by deriving precise phase transition curves tied to data size and quality. We validate these theoretical claims with empirical results on ImageNet, confirming our predictions about when curation improves accuracy and can even mitigate model collapse. Furthermore, our framework provides a principled explanation for the contradictory curation strategies recently observed in LLM mathematical reasoning.

Why Less is More (Sometimes): A Theory of Data Curation

TL;DR

Abstract

Why Less is More (Sometimes): A Theory of Data Curation

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (29)