Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training

Yiwei Qin; Zhen Huang; Tiantian Mi; Weiye Si; Chenyang Zhou; Qipeng Guo; Siyuan Feng; Pengfei Liu

Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training

Yiwei Qin, Zhen Huang, Tiantian Mi, Weiye Si, Chenyang Zhou, Qipeng Guo, Siyuan Feng, Pengfei Liu

TL;DR

This paper introduces Data Darwinism, a ten-level framework that treats data processing as an evolving Bellman-like co-evolution with model capabilities, shifting the focus from static data quality to data-model synergy. By constructing Darwin-Science—a 900B-token, science-focused corpus—and contaminant-free baselines, it demonstrates a learnability gap in raw scientific text and shows that advancing through L4 (Generative Refinement) and L5 (Cognitive Completion) yields significant, scalable gains, especially on domain-aligned evaluations. Through a controlled 600B-token continued pre-training, the authors show that higher-level data processing consistently improves performance, with larger models gaining more and domain-specific benchmarks revealing larger gains than generic benchmarks. The work provides actionable guidelines on composition ratios, the importance of teacher-model quality for cognitive completion, and the benefits of longer context, while releasing the Darwin-Science data and models to foster principled, co-evolutionary development of scientific AI systems.

Abstract

Data quality determines foundation model performance, yet systematic processing frameworks are lacking. We introduce Data Darwinism, a ten-level taxonomy (L0-L9) that conceptualizes data-model co-evolution: advanced models produce superior data for next-generation systems. We validate this on scientific literature by constructing Darwin-Science, a 900B-token corpus (L0-L5). We identify a learnability gap in raw scientific text, which we bridge via L4 (Generative Refinement) and L5 (Cognitive Completion) using frontier LLMs to explicate reasoning and terminology. To ensure rigorous attribution, we pre-trained daVinci-origin-3B/7B models from scratch, excluding scientific content to create contamination-free baselines. After 600B tokens of continued pre-training, Darwin-Science outperforms baselines by +2.12 (3B) and +2.95 (7B) points across 20+ benchmarks, rising to +5.60 and +8.40 points on domain-aligned tasks. Systematic progression to L5 yields a +1.36 total gain, confirming that higher-level processing unlocks latent data value. We release the Darwin-Science corpus and daVinci-origin models to enable principled, co-evolutionary development.

Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training

TL;DR

Abstract

Paper Structure (96 sections, 15 equations, 8 figures, 6 tables)

This paper contains 96 sections, 15 equations, 8 figures, 6 tables.

Introduction
Data Processing Hierarchy in Data Darwinism
L0: Data Acquisition Level
L1: Format Normalization Level
L2: Rule-based Filtering Level
L3: Lightweight Model Filtering Level
L4: Generative Refinement Level
L5: Cognitive Completion Level
L6: Contextual Completion Level
L7: Environment Synthesis Level
L8: Ecosystem Synthesis Level
L9: World Synthesis Level
Dataset Construction
L0: Data Acquisition
Publicly accessible resources
...and 81 more sections

Figures (8)

Figure 1: The Data Darwinism Pipeline. An evolutionary trajectory of data processing, illustrating the transition from raw data acquisition through model-driven refinement to the final stage of synthesized world generation.
Figure 2: Overview of Data Processing Hierarchy in Data Darwinism
Figure 3: Overview of the dataset construction pipeline.
Figure 4: Construction pipeline of our benchmark
Figure 5: Performance gains of daVinci-origin-3B and daVinci-origin-7B models. In both plots, the y-axis denotes the relative improvement over the corresponding base models.
...and 3 more figures

Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training

TL;DR

Abstract

Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training

Authors

TL;DR

Abstract

Table of Contents

Figures (8)