The sample complexity of multi-distribution learning

Binghui Peng

The sample complexity of multi-distribution learning

Binghui Peng

TL;DR

The paper tackles multi-distribution learning, where the goal is to minimize the worst-case population loss across $k$ distributions within $\epsilon$ of the optimal loss over a VC class with dimension $d$. It introduces a boosting framework based on multiplicative weight updates and a novel recursive width reduction to reduce the number of MWU rounds, achieving a near-optimal sample complexity of $\widetilde{O}((d+k)\epsilon^{-2}) \cdot (k/\epsilon)^{o(1)}$ (up to polylog factors). Central to the approach are the concepts of width reduction, the construction of an $\epsilon$-cover, and the soundness/completeness properties that preserve the optimal classifier while enabling aggressive truncation of losses. The method also removes the need for exact knowledge of OPT by running across an OPT grid and refining, ultimately culminating in a final algorithm with the stated near-optimal sample complexity. These results resolve the COLT 2023 open problem and demonstrate that multi-distribution learning need not be harder than single-distribution PAC learning in terms of sample complexity, with potential broader impact for boosting methods in agnostic, multi-distribution settings.

Abstract

Multi-distribution learning generalizes the classic PAC learning to handle data coming from multiple distributions. Given a set of $k$ data distributions and a hypothesis class of VC dimension $d$, the goal is to learn a hypothesis that minimizes the maximum population loss over $k$ distributions, up to $ε$ additive error. In this paper, we settle the sample complexity of multi-distribution learning by giving an algorithm of sample complexity $\widetilde{O}((d+k)ε^{-2}) \cdot (k/ε)^{o(1)}$. This matches the lower bound up to sub-polynomial factor and resolves the COLT 2023 open problem of Awasthi, Haghtalab and Zhao [AHZ23].

The sample complexity of multi-distribution learning

TL;DR

The paper tackles multi-distribution learning, where the goal is to minimize the worst-case population loss across

distributions within

of the optimal loss over a VC class with dimension

. It introduces a boosting framework based on multiplicative weight updates and a novel recursive width reduction to reduce the number of MWU rounds, achieving a near-optimal sample complexity of

(up to polylog factors). Central to the approach are the concepts of width reduction, the construction of an

-cover, and the soundness/completeness properties that preserve the optimal classifier while enabling aggressive truncation of losses. The method also removes the need for exact knowledge of OPT by running across an OPT grid and refining, ultimately culminating in a final algorithm with the stated near-optimal sample complexity. These results resolve the COLT 2023 open problem and demonstrate that multi-distribution learning need not be harder than single-distribution PAC learning in terms of sample complexity, with potential broader impact for boosting methods in agnostic, multi-distribution settings.

Abstract

Multi-distribution learning generalizes the classic PAC learning to handle data coming from multiple distributions. Given a set of

data distributions and a hypothesis class of VC dimension

, the goal is to learn a hypothesis that minimizes the maximum population loss over

distributions, up to

additive error. In this paper, we settle the sample complexity of multi-distribution learning by giving an algorithm of sample complexity

. This matches the lower bound up to sub-polynomial factor and resolves the COLT 2023 open problem of Awasthi, Haghtalab and Zhao [AHZ23].

Paper Structure (12 sections, 11 theorems, 43 equations, 5 algorithms)

This paper contains 12 sections, 11 theorems, 43 equations, 5 algorithms.

Introduction
Technical overview: Achieving optimal sample complexity via recursive width reduction
The MWU framework
Width reduction
Recursive width reduction
Remove prior knowledge on $\mathop{\mathrm{OPT}}\nolimits$
Related work
Concurrent and independent work
Preliminary
The boosting framework
Analysis
Final algorithm

Key Result

Theorem 1.1

Let $k$ be the number of distributions, $d$ be the VC dimension of the hypothesis class. For any $\epsilon > 0$, there is an algorithm that outputs an $\epsilon$-optimal classifier with probability $1-\delta$, and has sample complexity

Theorems & Definitions (19)

Theorem 1.1: Multi-distribution learning
Definition 2.1: Multi-distribution learning
Lemma 2.2: Sauer–Shelah Lemma sauer1972densityshelah1972combinatorial
Lemma 2.3: Regret guarantee of MWU arora2012multiplicative
Lemma 3.1: Boosting framework
Lemma 3.2: Guarantee of $\textsc{ConstructCover}$, adapted from Lemma 3.3 of alon2019limits
Lemma 3.3: Guarantee of $\textsc{Filter}$, Part 1
proof
Lemma 3.4: Guarantee of $\textsc{Filter}$, Part 2
proof
...and 9 more

The sample complexity of multi-distribution learning

TL;DR

Abstract

The sample complexity of multi-distribution learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (19)