Table of Contents
Fetching ...

Mixtera: A Data Plane for Foundation Model Training

Maximilian Böther, Xiaozhe Yao, Tolga Kerimoglu, Dan Graur, Viktor Gsteiger, Ana Klimovic

TL;DR

Mixtera addresses the challenge of managing and mixing training data for foundation models by introducing a centralized, declarative data plane that sits atop existing data collections. It enables dynamic, policy-driven mixtures through a chunk-based streaming mechanism and a metadata-centric architecture built on DuckDB, with a novel ChunkerIndex for efficient interval-based data retrieval. The approach supports dynamic adaptation via algorithms like Adaptive Data Optimization, maintains determinism for reproducible training, and integrates with standard training frameworks. Empirical results show Mixtera scales to large hardware, enables dynamic mixing improvements, and provides competitive throughput across data formats. Overall, Mixtera offers a practical, open, and extensible foundation for experiment-driven data composition in large-scale AI training.

Abstract

State-of-the-art large language and vision models are trained over trillions of tokens that are aggregated from a large variety of sources. As training data collections grow, manually managing the samples becomes time-consuming, tedious, and prone to errors. Yet recent research shows that the data mixture and the order in which samples are visited during training can significantly influence model accuracy. We build and present Mixtera, a data plane for foundation model training that enables users to declaratively express which data samples should be used in which proportion and in which order during training. Mixtera is a centralized, read-only layer that is deployed on top of existing training data collections and can be declaratively queried. It operates independently of the filesystem structure and supports mixtures across arbitrary properties (e.g., language, source dataset) as well as dynamic adjustment of the mixture based on model feedback. We experimentally evaluate Mixtera and show that our implementation does not bottleneck training and scales to 256 GH200 superchips. We demonstrate how Mixtera supports recent advancements in mixing strategies by implementing the proposed Adaptive Data Optimization (ADO) algorithm in the system and evaluating its performance impact. We also explore the role of mixtures for vision-language models.

Mixtera: A Data Plane for Foundation Model Training

TL;DR

Mixtera addresses the challenge of managing and mixing training data for foundation models by introducing a centralized, declarative data plane that sits atop existing data collections. It enables dynamic, policy-driven mixtures through a chunk-based streaming mechanism and a metadata-centric architecture built on DuckDB, with a novel ChunkerIndex for efficient interval-based data retrieval. The approach supports dynamic adaptation via algorithms like Adaptive Data Optimization, maintains determinism for reproducible training, and integrates with standard training frameworks. Empirical results show Mixtera scales to large hardware, enables dynamic mixing improvements, and provides competitive throughput across data formats. Overall, Mixtera offers a practical, open, and extensible foundation for experiment-driven data composition in large-scale AI training.

Abstract

State-of-the-art large language and vision models are trained over trillions of tokens that are aggregated from a large variety of sources. As training data collections grow, manually managing the samples becomes time-consuming, tedious, and prone to errors. Yet recent research shows that the data mixture and the order in which samples are visited during training can significantly influence model accuracy. We build and present Mixtera, a data plane for foundation model training that enables users to declaratively express which data samples should be used in which proportion and in which order during training. Mixtera is a centralized, read-only layer that is deployed on top of existing training data collections and can be declaratively queried. It operates independently of the filesystem structure and supports mixtures across arbitrary properties (e.g., language, source dataset) as well as dynamic adjustment of the mixture based on model feedback. We experimentally evaluate Mixtera and show that our implementation does not bottleneck training and scales to 256 GH200 superchips. We demonstrate how Mixtera supports recent advancements in mixing strategies by implementing the proposed Adaptive Data Optimization (ADO) algorithm in the system and evaluating its performance impact. We also explore the role of mixtures for vision-language models.

Paper Structure

This paper contains 21 sections, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Dynamically adjusting the mixture using the ADO algorithm improves pre-training performance on HellaSwag over the default static mixture across model scales.
  • Figure 2: Mixtera needs less offline processing and scripting.
  • Figure 3: Mixtera system architecture.
  • Figure 4: An example query using Mixtera.
  • Figure 5: Performance of the 1B model on HellaSwag, OpenBookQA, and ARC-Easy, measured every 2 500 steps.
  • ...and 3 more figures