Is Multi-Distribution Learning as Easy as PAC Learning: Sharp Rates with Bounded Label Noise

Rafael Hanashiro; Abhishek Shetty; Patrick Jaillet

Is Multi-Distribution Learning as Easy as PAC Learning: Sharp Rates with Bounded Label Noise

Rafael Hanashiro, Abhishek Shetty, Patrick Jaillet

TL;DR

It is demonstrated that learning across $k$ distributions inherently incurs slow rates scaling with $k/\epsilon^2$, even under constant noise levels, unless each distribution is learned separately, unless each distribution is learned separately.

Abstract

Towards understanding the statistical complexity of learning from heterogeneous sources, we study the problem of multi-distribution learning. Given $k$ data sources, the goal is to output a classifier for each source by exploiting shared structure to reduce sample complexity. We focus on the bounded label noise setting to determine whether the fast $1/ε$ rates achievable in single-task learning extend to this regime with minimal dependence on $k$. Surprisingly, we show that this is not the case. We demonstrate that learning across $k$ distributions inherently incurs slow rates scaling with $k/ε^2$, even under constant noise levels, unless each distribution is learned separately. A key technical contribution is a structured hypothesis-testing framework that captures the statistical cost of certifying near-optimality under bounded noise-a cost we show is unavoidable in the multi-distribution setting. Finally, we prove that when competing with the stronger benchmark of each distribution's optimal Bayes error, the sample complexity incurs a \textit{multiplicative} penalty in $k$. This establishes a \textit{statistical} separation between random classification noise and Massart noise, highlighting a fundamental barrier unique to learning from multiple sources.

Is Multi-Distribution Learning as Easy as PAC Learning: Sharp Rates with Bounded Label Noise

TL;DR

It is demonstrated that learning across

distributions inherently incurs slow rates scaling with

, even under constant noise levels, unless each distribution is learned separately, unless each distribution is learned separately.

Abstract

Towards understanding the statistical complexity of learning from heterogeneous sources, we study the problem of multi-distribution learning. Given

data sources, the goal is to output a classifier for each source by exploiting shared structure to reduce sample complexity. We focus on the bounded label noise setting to determine whether the fast

rates achievable in single-task learning extend to this regime with minimal dependence on

. Surprisingly, we show that this is not the case. We demonstrate that learning across

distributions inherently incurs slow rates scaling with

, even under constant noise levels, unless each distribution is learned separately. A key technical contribution is a structured hypothesis-testing framework that captures the statistical cost of certifying near-optimality under bounded noise-a cost we show is unavoidable in the multi-distribution setting. Finally, we prove that when competing with the stronger benchmark of each distribution's optimal Bayes error, the sample complexity incurs a \textit{multiplicative} penalty in

. This establishes a \textit{statistical} separation between random classification noise and Massart noise, highlighting a fundamental barrier unique to learning from multiple sources.

Paper Structure (57 sections, 19 theorems, 142 equations, 6 algorithms)

This paper contains 57 sections, 19 theorems, 142 equations, 6 algorithms.

Introduction
Notation.
Problem Setup
RCN.
Minimax.
Massart.
Overview and Contributions
Upper Bounds (\ref{['sec:ub']}).
Structured Hypothesis Testing (\ref{['sec:SHT']}).
MDL Lower Bounds (\ref{['sec:mdl-lb']}).
Separation Between RCN and Massart (\ref{['sec:mdl-mass']}).
Related Work
Multi-Distribution Learning.
Related Learning Settings.
PAC Learning.
...and 42 more sections

Key Result

Lemma 3.1

Let $\mathcal{U}\subset\cbr{P_1,\dots,P_k}$ and let $\bar{P}_\mathcal{U} = \frac{1}{\abs{\mathcal{U}}} \sum_{i\in\mathcal{U}} P_i$ be their uniform mixture. Then, ERM $\hat{f} = \mathop{\mathrm{ERM}}\nolimits_\mathcal{F}\del{S}$ on a sample $S\overset{iid}{\sim} \bar{P}_\mathcal{U}$ of size $\abs{S}

Theorems & Definitions (29)

Lemma 3.1
Lemma 3.2
Theorem 3.3: \ref{['eq:MDL-RCN']} upper bound
Theorem 3.4: \ref{['eq:MDL-MM']} upper bound
Remark : Condition for separate learning
Lemma 4.1: Testing via empirical errors
Lemma 4.2: From learning to testing
Theorem 4.3: \ref{['eq:SHT']} upper bound
Theorem 4.4: \ref{['eq:SHT']} lower bound
proof : Proof sketch of \ref{['thm:SHT-lb']}
...and 19 more

Is Multi-Distribution Learning as Easy as PAC Learning: Sharp Rates with Bounded Label Noise

TL;DR

Abstract

Is Multi-Distribution Learning as Easy as PAC Learning: Sharp Rates with Bounded Label Noise

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (29)