Unsupervised Domain Adaptation for Binary Classification with an Unobservable Source Subpopulation
Chao Ying, Jun Jin, Haotian Zhang, Qinglong Tian, Yanyuan Ma, Yixuan Li, Jiwei Zhao
TL;DR
This work addresses unsupervised domain adaptation for binary classification under structured missingness, where the source dataset lacks the subpopulation defined by $(Y,A)=(1,1)$. Under a conditional invariance assumption $p({\bf X}|Y,A,R=1)=p({\bf X}|Y,A,R=0)$, the authors derive target-domain predictions $\eta_1({\bf x})$, $\eta_0({\bf x})$, and $\eta({\bf x})$ that depend on observable quantities and target-subpopulation proportions. To estimate these proportions, they formulate a KL-divergence-based distribution-matching approach for the vector $\boldsymbol{\beta}=(\beta_{00},\beta_{10})$, achieving statistical consistency and providing a generalization bound for the resulting classifier that scales with the estimation error $\|\widehat{\boldsymbol\beta}-\boldsymbol\beta\|_1$ and the hypothesis class complexity. Theoretical guarantees are complemented by synthetic and real-data experiments (notably Waterbirds), showing improved predictive accuracy and F1 scores on the target, especially for the unobserved subpopulation, over naive benchmarks. Overall, the framework offers a rigorous characterization and practical solution for robust domain adaptation when a source subpopulation is entirely missing, with extensions to multi-class and multi-environment settings discussed. This work thus provides a principled path to reliable predictions under structured missingness and distributional shifts.
Abstract
We study an unsupervised domain adaptation problem where the source domain consists of subpopulations defined by the binary label $Y$ and a binary background (or environment) $A$. We focus on a challenging setting in which one such subpopulation in the source domain is unobservable. Naively ignoring this unobserved group can result in biased estimates and degraded predictive performance. Despite this structured missingness, we show that the prediction in the target domain can still be recovered. Specifically, we rigorously derive both background-specific and overall prediction models for the target domain. For practical implementation, we propose the distribution matching method to estimate the subpopulation proportions. We provide theoretical guarantees for the asymptotic behavior of our estimator, and establish an upper bound on the prediction error. Experiments on both synthetic and real-world datasets show that our method outperforms the naive benchmark that does not account for this unobservable source subpopulation.
