BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

Guosheng Dong; Da Pan; Yiding Sun; Shusen Zhang; Zheng Liang; Xin Wu; Yanjun Shen; Fan Yang; Haoze Sun; Tianpeng Li; Mingan Lin; Jianhua Xu; Yufan Zhang; Xiaonan Nie; Lei Su; Bingning Wang; Wentao Zhang; Jiaxin Mao; Zenan Zhou; Weipeng Chen

BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

Guosheng Dong, Da Pan, Yiding Sun, Shusen Zhang, Zheng Liang, Xin Wu, Yanjun Shen, Fan Yang, Haoze Sun, Tianpeng Li, Mingan Lin, Jianhua Xu, Yufan Zhang, Xiaonan Nie, Lei Su, Bingning Wang, Wentao Zhang, Jiaxin Mao, Zenan Zhou, Weipeng Chen

TL;DR

BaichuanSEED tackles transparency in LLM data by opening a data processing pipeline and training a pure 7B baseline on 3T tokens. It shows that broad collection and global deduplication with reweighting can yield competitive results without task-specific optimization. Evaluations against open and commercial baselines on comprehensive benchmarks reveal competitive performance, with notable strengths in Chinese knowledge benchmarks and LiveBench generalization, but room for improvement in mathematics and coding tasks. The work provides a reproducible, data-centric baseline to quantify the true impact of data processing choices and points to future directions in leveraging knowledge-intensive data and curriculum-style optimization.

Abstract

The general capabilities of Large Language Models (LLM) highly rely on the composition and selection on extensive pretraining datasets, treated as commercial secrets by several institutions. To mitigate this issue, we open-source the details of a universally applicable data processing pipeline and validate its effectiveness and potential by introducing a competitive LLM baseline. Specifically, the data processing pipeline consists of broad collection to scale up and reweighting to improve quality. We then pretrain a 7B model BaichuanSEED with 3T tokens processed by our pipeline without any deliberate downstream task-related optimization, followed by an easy but effective supervised fine-tuning stage. BaichuanSEED demonstrates consistency and predictability throughout training and achieves comparable performance on comprehensive benchmarks with several commercial advanced large language models, such as Qwen1.5 and Llama3. We also conduct several heuristic experiments to discuss the potential for further optimization of downstream tasks, such as mathematics and coding.

BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

TL;DR

Abstract

Paper Structure (21 sections, 6 figures, 8 tables)

This paper contains 21 sections, 6 figures, 8 tables.

Introduction
Model Architecture
Pre-training
Pre-training Data
Collection
Reweighting
Other Principles
Training Details
Supervised Fine-tuning
SFT Data
Training Details
Evaluation
Scaling Curves
Comprehensive Benchmarks
Discussion
...and 6 more sections

Figures (6)

Figure 1: The detail data proportion of subjects ranging from STEM, mathematics, social science, and others, with respect to the web pages, books, and papers.
Figure 2: The training loss with respect to different deduplication strategies on 2B models.
Figure 3: The performance scaling curve of BaichuanSEED with respect to the amount of training tokens.
Figure 4: The curve fit by the first half performance of BaichuanSEED with respect to the amount of training tokens on MMLU.
Figure 5: The performance of different checkpoints continued pre-training from a 1.44T token BaichuanSEED checkpoint with different proportion of mathematics-related data evaluated on MMLU, GSM8K, and MATH.
...and 1 more figures

BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

TL;DR

Abstract

BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

Authors

TL;DR

Abstract

Table of Contents

Figures (6)