MixMin: Finding Data Mixtures via Convex Minimization

Anvith Thudi; Evianne Rovers; Yangjun Ruan; Tristan Thrush; Chris J. Maddison

MixMin: Finding Data Mixtures via Convex Minimization

Anvith Thudi, Evianne Rovers, Yangjun Ruan, Tristan Thrush, Chris J. Maddison

TL;DR

MixMin reframes data source mixing as a convex optimization problem that arises when model expressivity is high, enabling gradient-based discovery of optimal source weights using cheap proxy models. For CE and MSE losses with no covariate shift, the Bayes-optimal mixture reduces to a linear combination of per-source Bayes models, allowing a simple empirical objective over a target dataset. The method optimizes this objective via entropic descent on the simplex using proxies, then remixes data according to the learned weights. Across language modeling and chemistry tasks, MixMin yields consistent improvements with only about 1% of complete training compute spent on proxies, and the benefits transfer to larger models and larger pools of sources, highlighting a scalable data curation paradigm. Limitations include reliance on no covariate shift and on CE/MSE losses, suggesting avenues for extending the framework to broader settings.

Abstract

Modern machine learning pipelines are increasingly combining and mixing data from diverse and disparate sources, e.g., pre-training large language models. Yet, finding the optimal data mixture is a challenging and open problem. We formalize this data mixing problem as a bi-level objective: the best mixture is the one that would lead to the best model for a downstream objective. Unfortunately, this objective is generally intractable. In this paper, we make the observation that the bi-level data mixing objective becomes convex as our model class becomes larger. We develop and study a gradient-based approach for optimizing this convex objective, which we call MixMin, and test it on language modeling and chemistry tasks. MixMin was the only method that uniformly improved the data mixture in all our experiments. With MixMin, we improved the data mixture using less than 0.2% additional compute for a pythia-410M model trained on 8.2B tokens, resulting between 1-5% relative improvement to negative log likelihood on PIQA, ARC Easy, SciQ, and OpenWebMath. Crucially, we found that MixMin mixtures for smaller models improved training of larger models, suggesting that MixMin mixtures may be scale-invariant. When mixing bioassay data to train an XGBoost model, we saw improvements to average precision scores of 0.03-0.15.

MixMin: Finding Data Mixtures via Convex Minimization

TL;DR

Abstract

MixMin: Finding Data Mixtures via Convex Minimization

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (18)

Theorems & Definitions (4)