Table of Contents
Fetching ...

Data Science and Technology Towards AGI Part I: Tiered Data Management

Yudong Wang, Zixuan Fu, Hengyu Zhao, Chen Zhao, Chuyue Zhou, Xinle Lin, Hongya Lyu, Shuaikang Xue, Yi Yi, Yingjiao Wang, Zhi Zheng, Yuzhou Zhang, Jie Zhou, Chaojun Xiao, Xu Han, Zhiyuan Liu, Maosong Sun

TL;DR

The paper argues that sustaining progress toward artificial general intelligence requires moving from unidirectional data scaling to a Data-Model Co-Evolution paradigm. It introduces a five-level L0–L4 tiered data management framework that aligns data quality, cost, and training objectives across pre-training, mid-training, and alignment stages, leveraging LLMs for data scoring, editing, and synthesis. Empirical studies across English and Chinese web data, mathematics, and code demonstrate that tiered data utilization improves training efficiency and model performance, with cross-domain benefits and scalable gains at higher tiers. A dedicated UltraData-Math case study confirms that higher-tier data yields transferable improvements and mitigates late-stage convergence issues. The authors also release tiered datasets and tools to foster community adoption, highlighting data management as a fundamental engineering pillar for scalable, sustainable AGI development.

Abstract

The development of artificial intelligence can be viewed as an evolution of data-driven learning paradigms, with successive shifts in data organization and utilization continuously driving advances in model capability. Current LLM research is dominated by a paradigm that relies heavily on unidirectional scaling of data size, increasingly encountering bottlenecks in data availability, acquisition cost, and training efficiency. In this work, we argue that the development of AGI is entering a new phase of data-model co-evolution, in which models actively guide data management while high-quality data, in turn, amplifies model capabilities. To implement this vision, we propose a tiered data management framework, designed to support the full LLM training lifecycle across heterogeneous learning objectives and cost constraints. Specifically, we introduce an L0-L4 tiered data management framework, ranging from raw uncurated resources to organized and verifiable knowledge. Importantly, LLMs are fully used in data management processes, such as quality scoring and content editing, to refine data across tiers. Each tier is characterized by distinct data properties, management strategies, and training roles, enabling data to be strategically allocated across LLM training stages, including pre-training, mid-training, and alignment. The framework balances data quality, acquisition cost, and marginal training benefit, providing a systematic approach to scalable and sustainable data management. We validate the effectiveness of the proposed framework through empirical studies, in which tiered datasets are constructed from raw corpora and used across multiple training phases. Experimental results demonstrate that tier-aware data utilization significantly improves training efficiency and model performance. To facilitate further research, we release our tiered datasets and processing tools to the community.

Data Science and Technology Towards AGI Part I: Tiered Data Management

TL;DR

The paper argues that sustaining progress toward artificial general intelligence requires moving from unidirectional data scaling to a Data-Model Co-Evolution paradigm. It introduces a five-level L0–L4 tiered data management framework that aligns data quality, cost, and training objectives across pre-training, mid-training, and alignment stages, leveraging LLMs for data scoring, editing, and synthesis. Empirical studies across English and Chinese web data, mathematics, and code demonstrate that tiered data utilization improves training efficiency and model performance, with cross-domain benefits and scalable gains at higher tiers. A dedicated UltraData-Math case study confirms that higher-tier data yields transferable improvements and mitigates late-stage convergence issues. The authors also release tiered datasets and tools to foster community adoption, highlighting data management as a fundamental engineering pillar for scalable, sustainable AGI development.

Abstract

The development of artificial intelligence can be viewed as an evolution of data-driven learning paradigms, with successive shifts in data organization and utilization continuously driving advances in model capability. Current LLM research is dominated by a paradigm that relies heavily on unidirectional scaling of data size, increasingly encountering bottlenecks in data availability, acquisition cost, and training efficiency. In this work, we argue that the development of AGI is entering a new phase of data-model co-evolution, in which models actively guide data management while high-quality data, in turn, amplifies model capabilities. To implement this vision, we propose a tiered data management framework, designed to support the full LLM training lifecycle across heterogeneous learning objectives and cost constraints. Specifically, we introduce an L0-L4 tiered data management framework, ranging from raw uncurated resources to organized and verifiable knowledge. Importantly, LLMs are fully used in data management processes, such as quality scoring and content editing, to refine data across tiers. Each tier is characterized by distinct data properties, management strategies, and training roles, enabling data to be strategically allocated across LLM training stages, including pre-training, mid-training, and alignment. The framework balances data quality, acquisition cost, and marginal training benefit, providing a systematic approach to scalable and sustainable data management. We validate the effectiveness of the proposed framework through empirical studies, in which tiered datasets are constructed from raw corpora and used across multiple training phases. Experimental results demonstrate that tier-aware data utilization significantly improves training efficiency and model performance. To facilitate further research, we release our tiered datasets and processing tools to the community.
Paper Structure (17 sections, 3 figures, 7 tables)

This paper contains 17 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Paradigm Shift in Data Organization and Utilization. The evolution of LMs and ultimately toward AGI fundamentally represents a paradigm shift in the organization and utilization of data, progressing through Symbolic, Supervised, Self-supervised, and Feedback Learning phases. We argue that the field is transitioning toward a Data-Model Co-Learning phase, which necessitates three critical research pillars: Scientific Data Value Assessment, Hierarchical Data Management, and Dynamic Data-Model Co-evolution to transcend current sustainability bottlenecks.
  • Figure 2: Tiered data management framework. This framework establishes a fine-grained standard centered on data quality and trustworthiness by defining five distinct stages. The framework sequentially elevates data purity through specialized operators while managing the progression from raw data to high-density knowledge assets and their associated computational costs.
  • Figure 3: Comparison of average score between mix training and tiered training at each checkpoint. The tiered training strategy (40B tokens per stage) consistently outperforms the mix training baseline across most evaluation intervals.