Unsupervised Domain Adaptation for Binary Classification with an Unobservable Source Subpopulation

Chao Ying; Jun Jin; Haotian Zhang; Qinglong Tian; Yanyuan Ma; Yixuan Li; Jiwei Zhao

Unsupervised Domain Adaptation for Binary Classification with an Unobservable Source Subpopulation

Chao Ying, Jun Jin, Haotian Zhang, Qinglong Tian, Yanyuan Ma, Yixuan Li, Jiwei Zhao

TL;DR

This work addresses unsupervised domain adaptation for binary classification under structured missingness, where the source dataset lacks the subpopulation defined by $(Y,A)=(1,1)$. Under a conditional invariance assumption $p({\bf X}|Y,A,R=1)=p({\bf X}|Y,A,R=0)$, the authors derive target-domain predictions $\eta_1({\bf x})$, $\eta_0({\bf x})$, and $\eta({\bf x})$ that depend on observable quantities and target-subpopulation proportions. To estimate these proportions, they formulate a KL-divergence-based distribution-matching approach for the vector $\boldsymbol{\beta}=(\beta_{00},\beta_{10})$, achieving statistical consistency and providing a generalization bound for the resulting classifier that scales with the estimation error $\|\widehat{\boldsymbol\beta}-\boldsymbol\beta\|_1$ and the hypothesis class complexity. Theoretical guarantees are complemented by synthetic and real-data experiments (notably Waterbirds), showing improved predictive accuracy and F1 scores on the target, especially for the unobserved subpopulation, over naive benchmarks. Overall, the framework offers a rigorous characterization and practical solution for robust domain adaptation when a source subpopulation is entirely missing, with extensions to multi-class and multi-environment settings discussed. This work thus provides a principled path to reliable predictions under structured missingness and distributional shifts.

Abstract

We study an unsupervised domain adaptation problem where the source domain consists of subpopulations defined by the binary label $Y$ and a binary background (or environment) $A$. We focus on a challenging setting in which one such subpopulation in the source domain is unobservable. Naively ignoring this unobserved group can result in biased estimates and degraded predictive performance. Despite this structured missingness, we show that the prediction in the target domain can still be recovered. Specifically, we rigorously derive both background-specific and overall prediction models for the target domain. For practical implementation, we propose the distribution matching method to estimate the subpopulation proportions. We provide theoretical guarantees for the asymptotic behavior of our estimator, and establish an upper bound on the prediction error. Experiments on both synthetic and real-world datasets show that our method outperforms the naive benchmark that does not account for this unobservable source subpopulation.

Unsupervised Domain Adaptation for Binary Classification with an Unobservable Source Subpopulation

TL;DR

This work addresses unsupervised domain adaptation for binary classification under structured missingness, where the source dataset lacks the subpopulation defined by

. Under a conditional invariance assumption

, the authors derive target-domain predictions

, and

that depend on observable quantities and target-subpopulation proportions. To estimate these proportions, they formulate a KL-divergence-based distribution-matching approach for the vector

, achieving statistical consistency and providing a generalization bound for the resulting classifier that scales with the estimation error

and the hypothesis class complexity. Theoretical guarantees are complemented by synthetic and real-data experiments (notably Waterbirds), showing improved predictive accuracy and F1 scores on the target, especially for the unobserved subpopulation, over naive benchmarks. Overall, the framework offers a rigorous characterization and practical solution for robust domain adaptation when a source subpopulation is entirely missing, with extensions to multi-class and multi-environment settings discussed. This work thus provides a principled path to reliable predictions under structured missingness and distributional shifts.

Abstract

We study an unsupervised domain adaptation problem where the source domain consists of subpopulations defined by the binary label

and a binary background (or environment)

. We focus on a challenging setting in which one such subpopulation in the source domain is unobservable. Naively ignoring this unobserved group can result in biased estimates and degraded predictive performance. Despite this structured missingness, we show that the prediction in the target domain can still be recovered. Specifically, we rigorously derive both background-specific and overall prediction models for the target domain. For practical implementation, we propose the distribution matching method to estimate the subpopulation proportions. We provide theoretical guarantees for the asymptotic behavior of our estimator, and establish an upper bound on the prediction error. Experiments on both synthetic and real-world datasets show that our method outperforms the naive benchmark that does not account for this unobservable source subpopulation.

Unsupervised Domain Adaptation for Binary Classification with an Unobservable Source Subpopulation

TL;DR

Abstract

Unsupervised Domain Adaptation for Binary Classification with an Unobservable Source Subpopulation

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (17)