Shift is Good: Mismatched Data Mixing Improves Test Performance

Marko Medvedev; Kaifeng Lyu; Zhiyuan Li; Nathan Srebro

Shift is Good: Mismatched Data Mixing Improves Test Performance

Marko Medvedev, Kaifeng Lyu, Zhiyuan Li, Nathan Srebro

TL;DR

This work reveals that training on a mismatched mixture of components can improve test performance on a target mixture, a phenomenon termed positive distribution shift (PDS). By formalizing a K-component mixture framework with test proportions p and train proportions q, the authors derive how optimal training mixing q^* and the associated gains in sample efficiency depend on per-component learning curves, including power-law and memorization scenarios, and extend the analysis to compositional reasoning and transfer learning. Across power-law tasks, memorization, and skill-composition experiments, mismatching training and test distributions yields a favorable prefactor in the loss decay with sample size N, often with no change in the asymptotic decay rate. A key theoretical takeaway is that, except for measure-zero edge cases, the optimal training distribution differs from the test distribution, implying that carefully chosen data-mixing proportions can meaningfully reduce data requirements. The practical significance lies in guiding data-collection and pretraining strategies for multi-task or compositional settings, including large language models, by highlighting when and how to shift training distributions to improve test performance on targeted mixtures.

Abstract

We consider training and testing on mixture distributions with different training and test proportions. We show that in many settings, and in some sense generically, distribution shift can be beneficial, and test performance can improve due to mismatched training proportions, even if the components are unrelated and with no transfer between components. In a variety of scenarios, we identify the optimal training proportions and the extent to which such distribution shift can be beneficial. We show how the same analysis applies also to a compositional setting with differing distribution of component "skills'' at training and test.

Shift is Good: Mismatched Data Mixing Improves Test Performance

TL;DR

Abstract

Shift is Good: Mismatched Data Mixing Improves Test Performance

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (37)