Adaptive Sample Aggregation In Transfer Learning
Steve Hanneke, Samory Kpotufe
TL;DR
The paper develops a unified theory of adaptive transfer learning across a broad family of source–target divergences by introducing the weak modulus $\delta(\epsilon)$ and its refinement, the strong modulus $\delta(\epsilon_1,\epsilon_2)$. It shows that adaptive procedures based on weak confidence sets can automatically adjust to unknown divergences, yielding rates that interpolate between source- and target-only performance, and proves corresponding minimax lower bounds. The strong modulus captures regimes where aggregating source and target data yields strictly faster rates than either source or target alone, and the authors provide adaptive procedures that attain near-optimal rates in this regime as well. A complete gap characterization between weak and strong moduli is given, tied to monotonicity properties of excess risks, with convex settings showing no gap and certain feature-selection scenarios exhibiting gaps. The work thus offers principled design guidelines for transfer-learning algorithms that are robust to a range of distributional shifts and provides a foundation for further practical and nonparametric extensions.
Abstract
Transfer Learning aims to optimally aggregate samples from a target distribution, with related samples from a so-called source distribution to improve target risk. Multiple procedures have been proposed over the last two decades to address this problem, each driven by one of a multitude of possible divergence measures between source and target distributions. A first question asked in this work is whether there exist unified algorithmic approaches that automatically adapt to many of these divergence measures simultaneously. We show that this is indeed the case for a large family of divergences proposed across classification and regression tasks, as they all happen to upper-bound the same measure of continuity between source and target risks, which we refer to as a weak modulus of transfer. This more unified view allows us, first, to identify algorithmic approaches that are simultaneously adaptive to these various divergence measures via a reduction to particular confidence sets. Second, it allows for a more refined understanding of the statistical limits of transfer under such divergences, and in particular, reveals regimes with faster rates than might be expected under coarser lenses. We then turn to situations that are not well captured by the weak modulus and corresponding divergences: these are situations where the aggregate of source and target data can improve target performance significantly beyond what's possible with either source or target data alone. We show that common such situations -- as may arise, e.g., under certain causal models with spurious correlations -- are well described by a so-called strong modulus of transfer which supersedes the weak modulus. We finally show that the strong modulus also admits adaptive procedures, which achieve near optimal rates in terms of the unknown strong modulus, and therefore apply in more general settings.
