Table of Contents
Fetching ...

SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity

Xiangyu Xi, Deyang Kong, Jian Yang, Jiawei Yang, Zhengyu Chen, Wei Wang, Jingang Wang, Xunliang Cai, Shikun Zhang, Wei Ye

TL;DR

This work addresses the limitations of domain-wise pretraining data mixing by recognizing substantial cross-domain overlaps and suboptimal intra-domain sampling. It introduces SampleMix, a sample-wise, bottom-up data mixing strategy that jointly optimizes per-sample quality and diversity to assemble a training dataset under a token budget $T_{\mathrm{tgt}}$. Through a GPT-4o-based ordinal quality evaluator and a clustering-based diversity evaluator, SampleMix assigns weights to samples and constructs the final dataset via a floor-plus-probabilistic rounding scheme, achieving higher downstream task accuracy and lower perplexity with 1.4x–2.1x fewer training steps. The method demonstrates robustness to varying token budgets and scales effectively to larger models, offering a practical, efficient approach to pretraining data selection and curation.

Abstract

Existing pretraining data mixing methods for large language models (LLMs) typically follow a domain-wise methodology, a top-down process that first determines domain weights and then performs uniform data sampling across each domain. However, these approaches neglect significant inter-domain overlaps and commonalities, failing to control the global diversity of the constructed training dataset. Further, uniform sampling within domains ignores fine-grained sample-specific features, potentially leading to suboptimal data distribution. To address these shortcomings, we propose a novel sample-wise data mixture approach based on a bottom-up paradigm. This method performs global cross-domain sampling by systematically evaluating the quality and diversity of each sample, thereby dynamically determining the optimal domain distribution. Comprehensive experiments across multiple downstream tasks and perplexity assessments demonstrate that SampleMix surpasses existing domain-based methods. Meanwhile, SampleMix requires 1.4x to 2.1x training steps to achieves the baselines' performance, highlighting the substantial potential of SampleMix to optimize pre-training data.

SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity

TL;DR

This work addresses the limitations of domain-wise pretraining data mixing by recognizing substantial cross-domain overlaps and suboptimal intra-domain sampling. It introduces SampleMix, a sample-wise, bottom-up data mixing strategy that jointly optimizes per-sample quality and diversity to assemble a training dataset under a token budget . Through a GPT-4o-based ordinal quality evaluator and a clustering-based diversity evaluator, SampleMix assigns weights to samples and constructs the final dataset via a floor-plus-probabilistic rounding scheme, achieving higher downstream task accuracy and lower perplexity with 1.4x–2.1x fewer training steps. The method demonstrates robustness to varying token budgets and scales effectively to larger models, offering a practical, efficient approach to pretraining data selection and curation.

Abstract

Existing pretraining data mixing methods for large language models (LLMs) typically follow a domain-wise methodology, a top-down process that first determines domain weights and then performs uniform data sampling across each domain. However, these approaches neglect significant inter-domain overlaps and commonalities, failing to control the global diversity of the constructed training dataset. Further, uniform sampling within domains ignores fine-grained sample-specific features, potentially leading to suboptimal data distribution. To address these shortcomings, we propose a novel sample-wise data mixture approach based on a bottom-up paradigm. This method performs global cross-domain sampling by systematically evaluating the quality and diversity of each sample, thereby dynamically determining the optimal domain distribution. Comprehensive experiments across multiple downstream tasks and perplexity assessments demonstrate that SampleMix surpasses existing domain-based methods. Meanwhile, SampleMix requires 1.4x to 2.1x training steps to achieves the baselines' performance, highlighting the substantial potential of SampleMix to optimize pre-training data.

Paper Structure

This paper contains 34 sections, 4 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: We conduct data clustering analysis using the SlimPajama dataset. For each domain (row), each cell shows the percentage of its clusters that also include samples from other domains (column). E.g., 76.60% of ArXiv's clusters include Wikipedia samples (1st row, 6th column). The results reveal substantial overlap between domains.
  • Figure 2: (a) Traditional methods determine domain weights and construct the training dataset by uniformly sampling from each domain. (b) SampleMix employs a sample-wise mixing strategy by: evaluating sample quality and diversity, assigning appropriate weights, and constructing an optimal dataset based on these weights. Dots of the same color represent data from the same domain..
  • Figure 3: Analysis of SlimPajama dataset. Mean values are marked with a dashed line.
  • Figure 4: Training efficiency comparison. SampleMix reaches the average baseline accuracy at 100k training steps - 1.9 times faster than the averaged baselines.
  • Figure 5: Average performance of downstream tasks with different weighting factor $\alpha$.
  • ...and 7 more figures