Table of Contents
Fetching ...

Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training

Yiwei Qin, Zhen Huang, Tiantian Mi, Weiye Si, Chenyang Zhou, Qipeng Guo, Siyuan Feng, Pengfei Liu

TL;DR

This paper introduces Data Darwinism, a ten-level framework that treats data processing as an evolving Bellman-like co-evolution with model capabilities, shifting the focus from static data quality to data-model synergy. By constructing Darwin-Science—a 900B-token, science-focused corpus—and contaminant-free baselines, it demonstrates a learnability gap in raw scientific text and shows that advancing through L4 (Generative Refinement) and L5 (Cognitive Completion) yields significant, scalable gains, especially on domain-aligned evaluations. Through a controlled 600B-token continued pre-training, the authors show that higher-level data processing consistently improves performance, with larger models gaining more and domain-specific benchmarks revealing larger gains than generic benchmarks. The work provides actionable guidelines on composition ratios, the importance of teacher-model quality for cognitive completion, and the benefits of longer context, while releasing the Darwin-Science data and models to foster principled, co-evolutionary development of scientific AI systems.

Abstract

Data quality determines foundation model performance, yet systematic processing frameworks are lacking. We introduce Data Darwinism, a ten-level taxonomy (L0-L9) that conceptualizes data-model co-evolution: advanced models produce superior data for next-generation systems. We validate this on scientific literature by constructing Darwin-Science, a 900B-token corpus (L0-L5). We identify a learnability gap in raw scientific text, which we bridge via L4 (Generative Refinement) and L5 (Cognitive Completion) using frontier LLMs to explicate reasoning and terminology. To ensure rigorous attribution, we pre-trained daVinci-origin-3B/7B models from scratch, excluding scientific content to create contamination-free baselines. After 600B tokens of continued pre-training, Darwin-Science outperforms baselines by +2.12 (3B) and +2.95 (7B) points across 20+ benchmarks, rising to +5.60 and +8.40 points on domain-aligned tasks. Systematic progression to L5 yields a +1.36 total gain, confirming that higher-level processing unlocks latent data value. We release the Darwin-Science corpus and daVinci-origin models to enable principled, co-evolutionary development.

Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training

TL;DR

This paper introduces Data Darwinism, a ten-level framework that treats data processing as an evolving Bellman-like co-evolution with model capabilities, shifting the focus from static data quality to data-model synergy. By constructing Darwin-Science—a 900B-token, science-focused corpus—and contaminant-free baselines, it demonstrates a learnability gap in raw scientific text and shows that advancing through L4 (Generative Refinement) and L5 (Cognitive Completion) yields significant, scalable gains, especially on domain-aligned evaluations. Through a controlled 600B-token continued pre-training, the authors show that higher-level data processing consistently improves performance, with larger models gaining more and domain-specific benchmarks revealing larger gains than generic benchmarks. The work provides actionable guidelines on composition ratios, the importance of teacher-model quality for cognitive completion, and the benefits of longer context, while releasing the Darwin-Science data and models to foster principled, co-evolutionary development of scientific AI systems.

Abstract

Data quality determines foundation model performance, yet systematic processing frameworks are lacking. We introduce Data Darwinism, a ten-level taxonomy (L0-L9) that conceptualizes data-model co-evolution: advanced models produce superior data for next-generation systems. We validate this on scientific literature by constructing Darwin-Science, a 900B-token corpus (L0-L5). We identify a learnability gap in raw scientific text, which we bridge via L4 (Generative Refinement) and L5 (Cognitive Completion) using frontier LLMs to explicate reasoning and terminology. To ensure rigorous attribution, we pre-trained daVinci-origin-3B/7B models from scratch, excluding scientific content to create contamination-free baselines. After 600B tokens of continued pre-training, Darwin-Science outperforms baselines by +2.12 (3B) and +2.95 (7B) points across 20+ benchmarks, rising to +5.60 and +8.40 points on domain-aligned tasks. Systematic progression to L5 yields a +1.36 total gain, confirming that higher-level processing unlocks latent data value. We release the Darwin-Science corpus and daVinci-origin models to enable principled, co-evolutionary development.
Paper Structure (96 sections, 15 equations, 8 figures, 6 tables)

This paper contains 96 sections, 15 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: The Data Darwinism Pipeline. An evolutionary trajectory of data processing, illustrating the transition from raw data acquisition through model-driven refinement to the final stage of synthesized world generation.
  • Figure 2: Overview of Data Processing Hierarchy in Data Darwinism
  • Figure 3: Overview of the dataset construction pipeline.
  • Figure 4: Construction pipeline of our benchmark
  • Figure 5: Performance gains of daVinci-origin-3B and daVinci-origin-7B models. In both plots, the y-axis denotes the relative improvement over the corresponding base models.
  • ...and 3 more figures