Out-Of-Domain Unlabeled Data Improves Generalization

Amir Hossein Saberi; Amir Najafi; Alireza Heidari; Mohammad Hosein Movasaghinia; Abolfazl Motahari; Babak H. Khalaj

Out-Of-Domain Unlabeled Data Improves Generalization

Amir Hossein Saberi, Amir Najafi, Alireza Heidari, Mohammad Hosein Movasaghinia, Abolfazl Motahari, Babak H. Khalaj

TL;DR

The paper tackles semi-supervised classification under distributional shifts by introducing Robust Self-Supervised (RSS) training, which fuses Distributionally Robust Optimization with self-training. RSS uses labeled data to optimize a robust loss while leveraging unlabeled data with pseudo labels to regularize the model, and it remains solvable in polynomial time under common convexity assumptions. The authors provide non-asymptotic generalization bounds for both robust and non-robust losses in a two-component Gaussian mixture, showing that unlabeled data can substantially narrow the generalization gap when $n \ge \Omega\left(m^2/d\right)$ and the shift $\alpha$ is controlled. Empirical results on simulated data and histopathology datasets corroborate the theory, demonstrating gains from out-of-domain unlabeled data, with larger benefits when unlabeled samples are not too far from the in-domain distribution. Overall, RSS offers a principled framework that combines self-training, DRO, and optimal transport to improve generalization in semi-supervised learning under distributional shifts.

Abstract

We propose a novel framework for incorporating unlabeled data into semi-supervised classification problems, where scenarios involving the minimization of either i) adversarially robust or ii) non-robust loss functions have been considered. Notably, we allow the unlabeled samples to deviate slightly (in total variation sense) from the in-domain distribution. The core idea behind our framework is to combine Distributionally Robust Optimization (DRO) with self-supervised training. As a result, we also leverage efficient polynomial-time algorithms for the training stage. From a theoretical standpoint, we apply our framework on the classification problem of a mixture of two Gaussians in $\mathbb{R}^d$, where in addition to the $m$ independent and labeled samples from the true distribution, a set of $n$ (usually with $n\gg m$) out of domain and unlabeled samples are given as well. Using only the labeled data, it is known that the generalization error can be bounded by $\propto\left(d/m\right)^{1/2}$. However, using our method on both isotropic and non-isotropic Gaussian mixture models, one can derive a new set of analytically explicit and non-asymptotic bounds which show substantial improvement on the generalization error compared to ERM. Our results underscore two significant insights: 1) out-of-domain samples, even when unlabeled, can be harnessed to narrow the generalization gap, provided that the true data distribution adheres to a form of the ``cluster assumption", and 2) the semi-supervised learning paradigm can be regarded as a special case of our framework when there are no distributional shifts. We validate our claims through experiments conducted on a variety of synthetic and real-world datasets.

Out-Of-Domain Unlabeled Data Improves Generalization

TL;DR

and the shift

is controlled. Empirical results on simulated data and histopathology datasets corroborate the theory, demonstrating gains from out-of-domain unlabeled data, with larger benefits when unlabeled samples are not too far from the in-domain distribution. Overall, RSS offers a principled framework that combines self-training, DRO, and optimal transport to improve generalization in semi-supervised learning under distributional shifts.

Abstract

, where in addition to the

independent and labeled samples from the true distribution, a set of

(usually with

) out of domain and unlabeled samples are given as well. Using only the labeled data, it is known that the generalization error can be bounded by

. However, using our method on both isotropic and non-isotropic Gaussian mixture models, one can derive a new set of analytically explicit and non-asymptotic bounds which show substantial improvement on the generalization error compared to ERM. Our results underscore two significant insights: 1) out-of-domain samples, even when unlabeled, can be harnessed to narrow the generalization gap, provided that the true data distribution adheres to a form of the ``cluster assumption", and 2) the semi-supervised learning paradigm can be regarded as a special case of our framework when there are no distributional shifts. We validate our claims through experiments conducted on a variety of synthetic and real-world datasets.

Paper Structure (22 sections, 5 theorems, 134 equations, 3 tables, 2 algorithms)

This paper contains 22 sections, 5 theorems, 134 equations, 3 tables, 2 algorithms.

introduction
prior works
Main Contributions
notation and definitions
problem definition
proposed method: Robust Self Supervised (RSS) training
model optimization: algorithm and theoretical guarantees
theoretical guarantees and generalization bounds
exprimental results
Experiment of simulated data
Experiment of Histopathology Data
Conclusion
Auxiliary Definitions
Proof of Theorems
Auxiliary Lemmas
...and 7 more sections

Key Result

Lemma 1.2

For a sufficiently small $\epsilon>0$, the minimax optimization problem of equation eq:originalMinimax has the following dual form: where $\gamma$ and $\epsilon$ are dual parameters, and there is a bijective and reciprocal relation between the $\epsilon$ and $\gamma^*$, i.e., the optimal value which minimizes the r.h.s.

Theorems & Definitions (20)

Definition 1.1: Distributionally Robust Learning($\mathrm{DRL}$)
Lemma 1.2: From blanchet2019robust
Definition 3.1: Robust Self-Supervised (RSS) Training
Theorem 4.1
Theorem 4.2
Corollary 4.3
Theorem 4.4: Generalization Bound for General Gaussian Mixture Models
Definition A.1: Wasserstein Distance
Definition A.2: $\epsilon$-neighborhood of a Distribution
proof : Proof of Theorem \ref{['robustLossgeneralization']}
...and 10 more

Out-Of-Domain Unlabeled Data Improves Generalization

TL;DR

Abstract

Out-Of-Domain Unlabeled Data Improves Generalization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (20)