Table of Contents
Fetching ...

AutoMixer: Checkpoint Artifacts as Automatic Data Mixers

Ernie Chang, Yang Li, Patrick Huber, Vish Vogeti, David Kant, Yangyang Shi, Vikas Chandra

TL;DR

AutoMixer tackles the data-curation bottleneck in large-scale language model pretraining by using checkpoint artifacts as automatic data mixers. It regroups raw training data into task-aligned groups based on multi-checkpoint influence signals and assigns sampling weights via joint influence densities, enabling dynamic, task-aware data loading. The framework employs efficient influence approximations and discriminative layer selection to scale influence estimation, and uses proxy-model simulations to identify optimal data mixtures. Across eight reasoning benchmarks and multiple model scales, AutoMixer–especially with a 350M proxy–achieves notable gains over uniform sampling, validating the value of checkpoint-guided data curation for targeted skill acquisition.

Abstract

In language model training, it is desirable to equip models with capabilities from various tasks. However, it is not clear how to directly obtain the right data mixtures for these capabilities as the relationship between data and tasks is difficult to be modeled. In this work, we observe that checkpoint models exhibit emerging capabilities at different points in the training trajectory. Often, the training process saves checkpoints as artifacts that are under-utilized as a source of in-training data signals. We identify these artifact models based on their respective capabilities on the benchmarks and leverage them as data mixers by using their aggregated first-order influence approximation over source data. We demonstrated on eight reasoning benchmarks that the proposed framework shows significant improvements in the pretraining setting, with performance improvements of up to 1.93%. Overall, this shows the potential of checkpoint models to enhance data quality and optimize data mixtures.

AutoMixer: Checkpoint Artifacts as Automatic Data Mixers

TL;DR

AutoMixer tackles the data-curation bottleneck in large-scale language model pretraining by using checkpoint artifacts as automatic data mixers. It regroups raw training data into task-aligned groups based on multi-checkpoint influence signals and assigns sampling weights via joint influence densities, enabling dynamic, task-aware data loading. The framework employs efficient influence approximations and discriminative layer selection to scale influence estimation, and uses proxy-model simulations to identify optimal data mixtures. Across eight reasoning benchmarks and multiple model scales, AutoMixer–especially with a 350M proxy–achieves notable gains over uniform sampling, validating the value of checkpoint-guided data curation for targeted skill acquisition.

Abstract

In language model training, it is desirable to equip models with capabilities from various tasks. However, it is not clear how to directly obtain the right data mixtures for these capabilities as the relationship between data and tasks is difficult to be modeled. In this work, we observe that checkpoint models exhibit emerging capabilities at different points in the training trajectory. Often, the training process saves checkpoints as artifacts that are under-utilized as a source of in-training data signals. We identify these artifact models based on their respective capabilities on the benchmarks and leverage them as data mixers by using their aggregated first-order influence approximation over source data. We demonstrated on eight reasoning benchmarks that the proposed framework shows significant improvements in the pretraining setting, with performance improvements of up to 1.93%. Overall, this shows the potential of checkpoint models to enhance data quality and optimize data mixtures.

Paper Structure

This paper contains 22 sections, 12 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of the checkpoint selection process and subsequent sampling. We leverage intermediate model checkpoints to group and sample data for targeted skill acquisition.
  • Figure 2: Illustration of the AutoMixer framework: Each data sample from ungrouped raw pretraining sources is assigned an influence score. These scores guide the regrouping of incoming data into task-specific datasets by (1) splitting raw data into groups based on task checkpoints, and (2) Determine sampling weights by aggregating influence scores across checkpoints.
  • Figure 3: Depiction of the Data Regrouping Process: Within each group, samples (w/ indices) are sorted based on their joint influence scores across all tasks. This sorting results in different sample orderings between groups. The final step involves selecting the top $K\%$ of samples from each group to form a data group that fulfills the token budget. Although there are duplicates across the table, we found that repeated samples contain high-value tokens that are beneficial for repeated exposure during training.
  • Figure 4: A performance comparison of two approaches (uniform sampling and AutoMixer-350M) across ten evenly spaced training steps (0 -- 100k). Both exhibit minor fluctuations yet follow an overall upward trend in accuracy. AutoMixer-350M consistently outperforms uniform sampling throughout training, ultimately reaching 56.45% accuracy versus 51.82% for uniform sampling.
  • Figure 5: Mean text quality score by range: We show the sample quality score across all buckets (grouped by percentiles) of samples sorted by influence scores. The normalized mean token count (in range $[0,1]$) per sample in the same set of buckets is labeled on each point. 75M proxy model tends to select longer sentences with higher influences.
  • ...and 1 more figures