Table of Contents
Fetching ...

When Invariant Representation Learning Meets Label Shift: Insufficiency and Theoretical Insights

You-Wei Luo, Chuan-Xian Ren

TL;DR

The main results show the insufficiency of invariant representation learning, and prove the sufficiency and necessity of GLS correction for generalization, which provide theoretical supports and innovations for exploring generalizable model under dataset shift.

Abstract

As a crucial step toward real-world learning scenarios with changing environments, dataset shift theory and invariant representation learning algorithm have been extensively studied to relax the identical distribution assumption in classical learning setting. Among the different assumptions on the essential of shifting distributions, generalized label shift (GLS) is the latest developed one which shows great potential to deal with the complex factors within the shift. In this paper, we aim to explore the limitations of current dataset shift theory and algorithm, and further provide new insights by presenting a comprehensive understanding of GLS. From theoretical aspect, two informative generalization bounds are derived, and the GLS learner is proved to be sufficiently close to optimal target model from the Bayesian perspective. The main results show the insufficiency of invariant representation learning, and prove the sufficiency and necessity of GLS correction for generalization, which provide theoretical supports and innovations for exploring generalizable model under dataset shift. From methodological aspect, we provide a unified view of existing shift correction frameworks, and propose a kernel embedding-based correction algorithm (KECA) to minimize the generalization error and achieve successful knowledge transfer. Both theoretical results and extensive experiment evaluations demonstrate the sufficiency and necessity of GLS correction for addressing dataset shift and the superiority of proposed algorithm.

When Invariant Representation Learning Meets Label Shift: Insufficiency and Theoretical Insights

TL;DR

The main results show the insufficiency of invariant representation learning, and prove the sufficiency and necessity of GLS correction for generalization, which provide theoretical supports and innovations for exploring generalizable model under dataset shift.

Abstract

As a crucial step toward real-world learning scenarios with changing environments, dataset shift theory and invariant representation learning algorithm have been extensively studied to relax the identical distribution assumption in classical learning setting. Among the different assumptions on the essential of shifting distributions, generalized label shift (GLS) is the latest developed one which shows great potential to deal with the complex factors within the shift. In this paper, we aim to explore the limitations of current dataset shift theory and algorithm, and further provide new insights by presenting a comprehensive understanding of GLS. From theoretical aspect, two informative generalization bounds are derived, and the GLS learner is proved to be sufficiently close to optimal target model from the Bayesian perspective. The main results show the insufficiency of invariant representation learning, and prove the sufficiency and necessity of GLS correction for generalization, which provide theoretical supports and innovations for exploring generalizable model under dataset shift. From methodological aspect, we provide a unified view of existing shift correction frameworks, and propose a kernel embedding-based correction algorithm (KECA) to minimize the generalization error and achieve successful knowledge transfer. Both theoretical results and extensive experiment evaluations demonstrate the sufficiency and necessity of GLS correction for addressing dataset shift and the superiority of proposed algorithm.

Paper Structure

This paper contains 14 sections, 9 theorems, 21 equations, 4 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Assume that $d_{\mathrm{JS}}(P_Y,Q_Y)\geq d_{\mathrm{JS}}(P_Z,Q_Z)$. Then for any transformation $g: X \mapsto Z$ and hypothesis $h: Z \mapsto Y$, we have

Figures (4)

  • Figure 1: Illustration of the invariant representation learning and dataset shift correction in $\mathbb{R}^1$. (a) The dataset shift exists as $P_{X|Y}\neq Q_{X|Y}$ and $P_Y\neq Q_Y$, where $p_Y=[\textcolor{fig_orange}{0.6};\textcolor{fig_blue}{0.4}]$ and $q_Y=[\textcolor{fig_orange}{0.4};\textcolor{fig_blue}{0.6}]$. (b) Marginal invariant transformation $g_{\mathrm{mar}}$ may misalign the conditional distributions. (c) Conditional invariant transformation $g_{\mathrm{con}}$ is still insufficient to align the joint distributions since label shift leads to different proportions of aligned conditional distributions. (d) GLS correction is sufficient to address dataset shift with the $w^*$ reweighting proportions of conditional distributions.
  • Figure 2: (a)-(b): Accuracy curves/bars and 95% confidence intervals of different dataset shift correction models. (c)-(d): Accuracy curves and label discrepancies $d_{\mathrm{TV}}(P_Y,Q_Y)$ with different subsampled rates and domains. (e)-(h): Visualization of representations and decision boundaries. '$\blacksquare$': class-wise means on source domain, '$\bullet$': target samples, 'background color': decision boundary.
  • Figure 3: (a)-(d): Visualization of representations of KECA model and Orcale model (i.e., KECA with ground-truth weight $w^*$) via t-SNE dimensionality reduction algorithm van2008visualizing on Office-31. '$\circ$': source samples; '$+$': target samples. (e)-(h): Curves of label discrepancy $d_{\mathrm{JS},a}(P^w_Y,Q_Y)$ and conditional discrepancy $D(P^w_{Z|Y},Q_{Z|Y})$ on Office-31 dataset. (e)-(f): curves of complete training process; (g)-(h): curves after the warm-up stage where the conditional matching and importance weight are applied.
  • Figure 4: Curves of source accuracy, target accuracy, loss objective $\mathcal{L}_{\mathrm{GLS}}$ and error of prior estimation $d_{\mathrm{TV}}(P^w_Y,Q_Y)$ on Office-31 dataset. (a) and (d): curves of complete training process; (b) and (e): curves after the warm-up stage where the conditional matching and importance weight are applied; (c) and (f): comparison between model w/ warm-up and model w/o warm-up.

Theorems & Definitions (17)

  • Theorem 1: Lower bound of joint error zhao2019learning
  • Definition 1: combes2020domain
  • Definition 2: Invariant transformations
  • Remark 1
  • Definition 3: Linear independence of functions
  • Proposition 1
  • Proposition 2: Impossibility of dataset shift correction
  • Proposition 3
  • Definition 4
  • Theorem 2: Sufficiency of GLS correction
  • ...and 7 more