Adaptive Sample Aggregation In Transfer Learning

Steve Hanneke; Samory Kpotufe

Adaptive Sample Aggregation In Transfer Learning

Steve Hanneke, Samory Kpotufe

TL;DR

The paper develops a unified theory of adaptive transfer learning across a broad family of source–target divergences by introducing the weak modulus $\delta(\epsilon)$ and its refinement, the strong modulus $\delta(\epsilon_1,\epsilon_2)$. It shows that adaptive procedures based on weak confidence sets can automatically adjust to unknown divergences, yielding rates that interpolate between source- and target-only performance, and proves corresponding minimax lower bounds. The strong modulus captures regimes where aggregating source and target data yields strictly faster rates than either source or target alone, and the authors provide adaptive procedures that attain near-optimal rates in this regime as well. A complete gap characterization between weak and strong moduli is given, tied to monotonicity properties of excess risks, with convex settings showing no gap and certain feature-selection scenarios exhibiting gaps. The work thus offers principled design guidelines for transfer-learning algorithms that are robust to a range of distributional shifts and provides a foundation for further practical and nonparametric extensions.

Abstract

Transfer Learning aims to optimally aggregate samples from a target distribution, with related samples from a so-called source distribution to improve target risk. Multiple procedures have been proposed over the last two decades to address this problem, each driven by one of a multitude of possible divergence measures between source and target distributions. A first question asked in this work is whether there exist unified algorithmic approaches that automatically adapt to many of these divergence measures simultaneously. We show that this is indeed the case for a large family of divergences proposed across classification and regression tasks, as they all happen to upper-bound the same measure of continuity between source and target risks, which we refer to as a weak modulus of transfer. This more unified view allows us, first, to identify algorithmic approaches that are simultaneously adaptive to these various divergence measures via a reduction to particular confidence sets. Second, it allows for a more refined understanding of the statistical limits of transfer under such divergences, and in particular, reveals regimes with faster rates than might be expected under coarser lenses. We then turn to situations that are not well captured by the weak modulus and corresponding divergences: these are situations where the aggregate of source and target data can improve target performance significantly beyond what's possible with either source or target data alone. We show that common such situations -- as may arise, e.g., under certain causal models with spurious correlations -- are well described by a so-called strong modulus of transfer which supersedes the weak modulus. We finally show that the strong modulus also admits adaptive procedures, which achieve near optimal rates in terms of the unknown strong modulus, and therefore apply in more general settings.

Adaptive Sample Aggregation In Transfer Learning

TL;DR

The paper develops a unified theory of adaptive transfer learning across a broad family of source–target divergences by introducing the weak modulus

and its refinement, the strong modulus

. It shows that adaptive procedures based on weak confidence sets can automatically adjust to unknown divergences, yielding rates that interpolate between source- and target-only performance, and proves corresponding minimax lower bounds. The strong modulus captures regimes where aggregating source and target data yields strictly faster rates than either source or target alone, and the authors provide adaptive procedures that attain near-optimal rates in this regime as well. A complete gap characterization between weak and strong moduli is given, tied to monotonicity properties of excess risks, with convex settings showing no gap and certain feature-selection scenarios exhibiting gaps. The work thus offers principled design guidelines for transfer-learning algorithms that are robust to a range of distributional shifts and provides a foundation for further practical and nonparametric extensions.

Abstract

Paper Structure (44 sections, 39 theorems, 118 equations, 6 figures)

This paper contains 44 sections, 39 theorems, 118 equations, 6 figures.

Introduction
Formal Overview.
Other Related Works.
Preliminaries
Basic Definitions.
Transfer Setting.
Weak Modulus of Transfer
Some Existing Discrepancies vs Weak Modulus
Some Discrepancies in Classification.
Some Discrepancies in Regression.
Adaptive Transfer Upper-Bounds
Examples of Weak Confidence Sets
Classification with $0$-$1$ loss.
Regression with Squared Loss.
Lower-Bounds
...and 29 more sections

Key Result

Proposition 1

The weak modulus $\delta(\epsilon)$ is non-decreasing in $\epsilon$, i.e., for all $\epsilon \leq \epsilon'$ it holds that $\delta(\epsilon) \leq \delta(\epsilon')$.

Figures (6)

Figure 1: We assume no risk minimizers in the analysis, however the reader might find the geometric illustration more intuitive. The ellipsoid centered at $h^{\!*}_P$ depicts $\mathcal{H}_P(\epsilon)$, while $\delta(\epsilon)$ is the smallest $\epsilon_Q$ s.t. $\mathcal{H}_Q(\epsilon_Q)$ (illustrated as a ball cantered at $h^{\!*}_Q$) contains $\mathcal{H}_P(\epsilon)$.
Figure 2: Let $\mu$ denote $P$ or $Q$; an $\epsilon_\mu$-Confidence set $\hat{\mathcal{H}}_\mu$ is illustrated here with solid gray boundary; the ball of radius $\epsilon_\mu$ centered at $h^{\!*}_\mu$ represents $\mathcal{H}_\mu(\epsilon_\mu)$. Access to such sets $\hat{\mathcal{H}}_\mu$ are sufficient for adapting to unknown modulus $\delta_{\hbox{P,Q}}(\cdot)$, implying adaptation to any discrepancy measure that $\delta_{\hbox{P,Q}}(\cdot)$ lower-bounds.
Figure 3: Strong Modulus. The worst-case $Q$-risk in the retained ellipsoid determines $\delta(\epsilon_1, \epsilon_2)$.
Figure 4: Suppose $\mathcal{H}$ consists of one-sided thresholds on the line, $Y = h^{\!*}_\mu(x)$ for $\mu \equiv P$ or $Q$. It's easy to see that $\mathcal{E}_Q(h)$ (mass of disagreement regions) goes up as $h$ approaches $h^{\!*}_P$ from the left, and goes down otherwise. The problem therefore has no gap between moduli.
Figure 5: Suppose $P$ is supported in the gray region so that $h^{\!*}_{P, 1}, h^{\!*}_{P, 2}$ are optimal under $P$ for classifying square vs circle. Notice that $h^{\!*}_P$ however relies on spurious features; a small amount of data from $Q$, some falling in the upper-left quadrant, allows rejecting $h^{\!*}_{P, 2}$.
...and 1 more figures

Theorems & Definitions (120)

Remark 1
Definition 1
Definition 2: $\epsilon$-Minimal Set
Definition 3: Weak Modulus
Definition 4: Upper Pivot
Proposition 1
Remark 2: Tightness
Remark 3: Tightness
Definition 5
Remark 4
...and 110 more

Adaptive Sample Aggregation In Transfer Learning

TL;DR

Abstract

Adaptive Sample Aggregation In Transfer Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (120)