Table of Contents
Fetching ...

Data Darwinism Part II: DataEvolve -- AI can Autonomously Evolve Pretraining Data Curation

Tiantian Mi, Dongming Shan, Zhen Huang, Yiwei Qin, Muhang Xie, Yuxuan Qiao, Yixiu Liu, Chenyang Zhou, Pengfei Liu

Abstract

Data Darwinism (Part I) established a ten-level hierarchy for data processing, showing that stronger processing can unlock greater data value. However, that work relied on manually designed strategies for a single category. Modern pretraining corpora comprise hundreds of heterogeneous categories spanning domains and content types, each demanding specialized treatment. At this scale, manual strategy design becomes prohibitive. This raises a key question: can strategies evolve in an automated way? We introduce DataEvolve, a framework that enables strategies to evolve through iterative optimization rather than manual design. For each data category, DataEvolve operates in a closed evolutionary loop: it identifies quality issues, generates candidate strategies, executes them on sampled data, evaluates results, and refines approaches across generations. The process accumulates knowledge through an experience pool of discovered issues and a strategy pool tracking performance across iterations. Applied to 8 categories spanning 672B tokens from Nemotron-CC, DataEvolve produces Darwin-CC, a 504B-token dataset with strategies evolved through 30 iterations per category. Training 3B models on 500B tokens, Darwin-CC outperforms raw data (+3.96 points) and achieves a 44.13 average score across 18 benchmarks, surpassing DCLM, Ultra-FineWeb, and FineWeb-Edu, with strong gains on knowledge-intensive tasks such as MMLU. Analysis shows evolved strategies converge on cleaning-focused approaches: targeted noise removal and format normalization with domain-aware preservation, echoing the L4 (Generative Refinement) principles from Part I. Ablation studies confirm iterative evolution is essential: optimized strategies outperform suboptimal ones by 2.93 points, establishing evolutionary strategy design as feasible and necessary for pretraining-scale data curation.

Data Darwinism Part II: DataEvolve -- AI can Autonomously Evolve Pretraining Data Curation

Abstract

Data Darwinism (Part I) established a ten-level hierarchy for data processing, showing that stronger processing can unlock greater data value. However, that work relied on manually designed strategies for a single category. Modern pretraining corpora comprise hundreds of heterogeneous categories spanning domains and content types, each demanding specialized treatment. At this scale, manual strategy design becomes prohibitive. This raises a key question: can strategies evolve in an automated way? We introduce DataEvolve, a framework that enables strategies to evolve through iterative optimization rather than manual design. For each data category, DataEvolve operates in a closed evolutionary loop: it identifies quality issues, generates candidate strategies, executes them on sampled data, evaluates results, and refines approaches across generations. The process accumulates knowledge through an experience pool of discovered issues and a strategy pool tracking performance across iterations. Applied to 8 categories spanning 672B tokens from Nemotron-CC, DataEvolve produces Darwin-CC, a 504B-token dataset with strategies evolved through 30 iterations per category. Training 3B models on 500B tokens, Darwin-CC outperforms raw data (+3.96 points) and achieves a 44.13 average score across 18 benchmarks, surpassing DCLM, Ultra-FineWeb, and FineWeb-Edu, with strong gains on knowledge-intensive tasks such as MMLU. Analysis shows evolved strategies converge on cleaning-focused approaches: targeted noise removal and format normalization with domain-aware preservation, echoing the L4 (Generative Refinement) principles from Part I. Ablation studies confirm iterative evolution is essential: optimized strategies outperform suboptimal ones by 2.93 points, establishing evolutionary strategy design as feasible and necessary for pretraining-scale data curation.
Paper Structure (36 sections, 13 equations, 9 figures, 4 tables)

This paper contains 36 sections, 13 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Performance comparison across pretraining datasets on 18 selected benchmarks.
  • Figure 2: Naively evaluating candidate curation strategies requires cleaning data at full scale and training a model to convergence for each candidate strategy—demanding thousands of GPU hours per evaluation. Across $n$ candidate strategies and $m$ categories, the total cost becomes computationally intractable.
  • Figure 3: Overview of the DataEvolve framework. The system enables strategies to evolve through an iterative feedback loop involving four core components: (1) the data observer identifies category-specific quality issues; (2) the strategy designer generates and refines cleaning strategies; (3) the data cleaner executes strategies on sample data; and (4) the quality judge provides scoring and diagnostic feedback. Discovered issues and evolved strategies are archived in the experience pool and strategy pool, enabling cross-generation knowledge transfer to guide evolutionary progression.
  • Figure 4: Learning curves of 3B models trained on raw data, suboptimal strategy, and optimized strategy (Darwin-CC) over 500B tokens.
  • Figure 5: Learning curves comparison across pretraining corpora. All models are 3B parameters trained for 500B tokens.
  • ...and 4 more figures