Table of Contents
Fetching ...

Out-Of-Domain Unlabeled Data Improves Generalization

Amir Hossein Saberi, Amir Najafi, Alireza Heidari, Mohammad Hosein Movasaghinia, Abolfazl Motahari, Babak H. Khalaj

TL;DR

The paper tackles semi-supervised classification under distributional shifts by introducing Robust Self-Supervised (RSS) training, which fuses Distributionally Robust Optimization with self-training. RSS uses labeled data to optimize a robust loss while leveraging unlabeled data with pseudo labels to regularize the model, and it remains solvable in polynomial time under common convexity assumptions. The authors provide non-asymptotic generalization bounds for both robust and non-robust losses in a two-component Gaussian mixture, showing that unlabeled data can substantially narrow the generalization gap when $n \ge \Omega\left(m^2/d\right)$ and the shift $\alpha$ is controlled. Empirical results on simulated data and histopathology datasets corroborate the theory, demonstrating gains from out-of-domain unlabeled data, with larger benefits when unlabeled samples are not too far from the in-domain distribution. Overall, RSS offers a principled framework that combines self-training, DRO, and optimal transport to improve generalization in semi-supervised learning under distributional shifts.

Abstract

We propose a novel framework for incorporating unlabeled data into semi-supervised classification problems, where scenarios involving the minimization of either i) adversarially robust or ii) non-robust loss functions have been considered. Notably, we allow the unlabeled samples to deviate slightly (in total variation sense) from the in-domain distribution. The core idea behind our framework is to combine Distributionally Robust Optimization (DRO) with self-supervised training. As a result, we also leverage efficient polynomial-time algorithms for the training stage. From a theoretical standpoint, we apply our framework on the classification problem of a mixture of two Gaussians in $\mathbb{R}^d$, where in addition to the $m$ independent and labeled samples from the true distribution, a set of $n$ (usually with $n\gg m$) out of domain and unlabeled samples are given as well. Using only the labeled data, it is known that the generalization error can be bounded by $\propto\left(d/m\right)^{1/2}$. However, using our method on both isotropic and non-isotropic Gaussian mixture models, one can derive a new set of analytically explicit and non-asymptotic bounds which show substantial improvement on the generalization error compared to ERM. Our results underscore two significant insights: 1) out-of-domain samples, even when unlabeled, can be harnessed to narrow the generalization gap, provided that the true data distribution adheres to a form of the ``cluster assumption", and 2) the semi-supervised learning paradigm can be regarded as a special case of our framework when there are no distributional shifts. We validate our claims through experiments conducted on a variety of synthetic and real-world datasets.

Out-Of-Domain Unlabeled Data Improves Generalization

TL;DR

The paper tackles semi-supervised classification under distributional shifts by introducing Robust Self-Supervised (RSS) training, which fuses Distributionally Robust Optimization with self-training. RSS uses labeled data to optimize a robust loss while leveraging unlabeled data with pseudo labels to regularize the model, and it remains solvable in polynomial time under common convexity assumptions. The authors provide non-asymptotic generalization bounds for both robust and non-robust losses in a two-component Gaussian mixture, showing that unlabeled data can substantially narrow the generalization gap when and the shift is controlled. Empirical results on simulated data and histopathology datasets corroborate the theory, demonstrating gains from out-of-domain unlabeled data, with larger benefits when unlabeled samples are not too far from the in-domain distribution. Overall, RSS offers a principled framework that combines self-training, DRO, and optimal transport to improve generalization in semi-supervised learning under distributional shifts.

Abstract

We propose a novel framework for incorporating unlabeled data into semi-supervised classification problems, where scenarios involving the minimization of either i) adversarially robust or ii) non-robust loss functions have been considered. Notably, we allow the unlabeled samples to deviate slightly (in total variation sense) from the in-domain distribution. The core idea behind our framework is to combine Distributionally Robust Optimization (DRO) with self-supervised training. As a result, we also leverage efficient polynomial-time algorithms for the training stage. From a theoretical standpoint, we apply our framework on the classification problem of a mixture of two Gaussians in , where in addition to the independent and labeled samples from the true distribution, a set of (usually with ) out of domain and unlabeled samples are given as well. Using only the labeled data, it is known that the generalization error can be bounded by . However, using our method on both isotropic and non-isotropic Gaussian mixture models, one can derive a new set of analytically explicit and non-asymptotic bounds which show substantial improvement on the generalization error compared to ERM. Our results underscore two significant insights: 1) out-of-domain samples, even when unlabeled, can be harnessed to narrow the generalization gap, provided that the true data distribution adheres to a form of the ``cluster assumption", and 2) the semi-supervised learning paradigm can be regarded as a special case of our framework when there are no distributional shifts. We validate our claims through experiments conducted on a variety of synthetic and real-world datasets.
Paper Structure (22 sections, 5 theorems, 134 equations, 3 tables, 2 algorithms)

This paper contains 22 sections, 5 theorems, 134 equations, 3 tables, 2 algorithms.

Key Result

Lemma 1.2

For a sufficiently small $\epsilon>0$, the minimax optimization problem of equation eq:originalMinimax has the following dual form: where $\gamma$ and $\epsilon$ are dual parameters, and there is a bijective and reciprocal relation between the $\epsilon$ and $\gamma^*$, i.e., the optimal value which minimizes the r.h.s.

Theorems & Definitions (20)

  • Definition 1.1: Distributionally Robust Learning($\mathrm{DRL}$)
  • Lemma 1.2: From blanchet2019robust
  • Definition 3.1: Robust Self-Supervised (RSS) Training
  • Theorem 4.1
  • Theorem 4.2
  • Corollary 4.3
  • Theorem 4.4: Generalization Bound for General Gaussian Mixture Models
  • Definition A.1: Wasserstein Distance
  • Definition A.2: $\epsilon$-neighborhood of a Distribution
  • proof : Proof of Theorem \ref{['robustLossgeneralization']}
  • ...and 10 more