Table of Contents
Fetching ...

Shift is Good: Mismatched Data Mixing Improves Test Performance

Marko Medvedev, Kaifeng Lyu, Zhiyuan Li, Nathan Srebro

TL;DR

This work reveals that training on a mismatched mixture of components can improve test performance on a target mixture, a phenomenon termed positive distribution shift (PDS). By formalizing a K-component mixture framework with test proportions p and train proportions q, the authors derive how optimal training mixing q^* and the associated gains in sample efficiency depend on per-component learning curves, including power-law and memorization scenarios, and extend the analysis to compositional reasoning and transfer learning. Across power-law tasks, memorization, and skill-composition experiments, mismatching training and test distributions yields a favorable prefactor in the loss decay with sample size N, often with no change in the asymptotic decay rate. A key theoretical takeaway is that, except for measure-zero edge cases, the optimal training distribution differs from the test distribution, implying that carefully chosen data-mixing proportions can meaningfully reduce data requirements. The practical significance lies in guiding data-collection and pretraining strategies for multi-task or compositional settings, including large language models, by highlighting when and how to shift training distributions to improve test performance on targeted mixtures.

Abstract

We consider training and testing on mixture distributions with different training and test proportions. We show that in many settings, and in some sense generically, distribution shift can be beneficial, and test performance can improve due to mismatched training proportions, even if the components are unrelated and with no transfer between components. In a variety of scenarios, we identify the optimal training proportions and the extent to which such distribution shift can be beneficial. We show how the same analysis applies also to a compositional setting with differing distribution of component "skills'' at training and test.

Shift is Good: Mismatched Data Mixing Improves Test Performance

TL;DR

This work reveals that training on a mismatched mixture of components can improve test performance on a target mixture, a phenomenon termed positive distribution shift (PDS). By formalizing a K-component mixture framework with test proportions p and train proportions q, the authors derive how optimal training mixing q^* and the associated gains in sample efficiency depend on per-component learning curves, including power-law and memorization scenarios, and extend the analysis to compositional reasoning and transfer learning. Across power-law tasks, memorization, and skill-composition experiments, mismatching training and test distributions yields a favorable prefactor in the loss decay with sample size N, often with no change in the asymptotic decay rate. A key theoretical takeaway is that, except for measure-zero edge cases, the optimal training distribution differs from the test distribution, implying that carefully chosen data-mixing proportions can meaningfully reduce data requirements. The practical significance lies in guiding data-collection and pretraining strategies for multi-task or compositional settings, including large language models, by highlighting when and how to shift training distributions to improve test performance on targeted mixtures.

Abstract

We consider training and testing on mixture distributions with different training and test proportions. We show that in many settings, and in some sense generically, distribution shift can be beneficial, and test performance can improve due to mismatched training proportions, even if the components are unrelated and with no transfer between components. In a variety of scenarios, we identify the optimal training proportions and the extent to which such distribution shift can be beneficial. We show how the same analysis applies also to a compositional setting with differing distribution of component "skills'' at training and test.

Paper Structure

This paper contains 37 sections, 19 theorems, 88 equations, 3 figures.

Key Result

Theorem 3.1

In model:generalpowerlaw, if for the exponents it holds that $\alpha_1 = \alpha_2 = \dots = \alpha_S < \alpha_{S+1} \le \alpha_{S+2} \le \dots \le \alpha_{K}$ for some $S$, then there exist $\varepsilon_1,\varepsilon_2>0$ such that for any test data mixing ratio ${\bm{p}}$ and any $N>N_0(\{A_i,B_i,\

Figures (3)

  • Figure 1: We plot the error rate for a hypothetical scenario modelling the high stakes exam described in \ref{['sec:introduction']}. We model the error rate on each of the test portions as being proportional to $\propto \frac{1}{n_i^{\alpha}}$, where $n_i$ represents the studying budget spent on that portion of the exam, so $i=1$ corresponds to European History and $i=2$ to the Chinese History and set $n_1+n_2=N$ to be the total studying budget, with $N=100$ hours. The exponent $\alpha$ is $\alpha=1$ on the left plot and $\alpha=2$ on the right plot. In both cases, we consider $n_1=qN$ and $n_2=(1-q)N$, where $q$ is the proportion of time spent studying for the European History portion of the exam. This way, the error rate on the exam can be written as a function of $q$ as $L(q) = 0.9\frac{1}{(100q)^{\alpha}}+0.1\frac{1}{(100q)^{\alpha}}$. We can see on both plots that shifting away from the testing proportion (red line, i.e. $q=90\%$) can lead to a better error rate with the optimal test proportion (green line, i.e., $q^*$ whose values are displayed accordingly). See also \ref{['cor:main:generalpower']}.
  • Figure 2: We consider the setup of \ref{['cor:main:generalpower']} with $A=1,~\alpha=0.28$, $K=100$, and some fixed $N$. On the left plot, we show the "non-shifted" expected population loss $L^{\mathrm{same}}({\bm{p}})$ and the optimally mixed expected population loss $L^*({\bm{p}})$ as a function of majority population mass $p$. On the right plot, we show the ratio of sample complexities for any fixed $\epsilon>0$, $N^{\textrm{ratio}}_{\epsilon}({\bm{p}})$ as a function of the mass of the majority population, $p$. We can see significant improvement in the sample complexity from the positive distribution shift from using optimal mixing ratio, even up to $\approx 25\%$.
  • Figure 3: Mismatched distribution improves the test accuracy of a language model in solving a synthetic skill composition task (\ref{['sec:composition']}). During test, the model is asked to compose several functions, sampled following a power law. Instead of training directly on this task (blue curve), mixing with another task that uniformly samples the functions improves the final accuracy (orange curve). Curves are averaged over $5$ random seeds.

Theorems & Definitions (37)

  • Theorem 3.1: Optimal Data Mixing Ratios For Power Law
  • Corollary 1: Sample Complexity Improvement From Optimal Data Mixing For General Power Law
  • Theorem 4.1: Optimal Data Mixing Test Error Improvement For Memorization Task
  • Corollary 2: Test Error Improvement For Memorization Taks with Power Law Test Mixing Ratios
  • Theorem 7.1: Positive Distribution Shift Almost Always Exists For Data Mixing
  • Lemma 1: Independent Tasks
  • Corollary 3: Positive Distribution Shift Always Exists
  • Definition 1: Approximate Subpopulation Error Function
  • Proposition 1: Sufficient to Consider Expectation
  • proof : Proof of \ref{['lemma:app:sufficientfk']}
  • ...and 27 more