Table of Contents
Fetching ...

Unsupervised Domain Adaptation for Binary Classification with an Unobservable Source Subpopulation

Chao Ying, Jun Jin, Haotian Zhang, Qinglong Tian, Yanyuan Ma, Yixuan Li, Jiwei Zhao

TL;DR

This work addresses unsupervised domain adaptation for binary classification under structured missingness, where the source dataset lacks the subpopulation defined by $(Y,A)=(1,1)$. Under a conditional invariance assumption $p({\bf X}|Y,A,R=1)=p({\bf X}|Y,A,R=0)$, the authors derive target-domain predictions $\eta_1({\bf x})$, $\eta_0({\bf x})$, and $\eta({\bf x})$ that depend on observable quantities and target-subpopulation proportions. To estimate these proportions, they formulate a KL-divergence-based distribution-matching approach for the vector $\boldsymbol{\beta}=(\beta_{00},\beta_{10})$, achieving statistical consistency and providing a generalization bound for the resulting classifier that scales with the estimation error $\|\widehat{\boldsymbol\beta}-\boldsymbol\beta\|_1$ and the hypothesis class complexity. Theoretical guarantees are complemented by synthetic and real-data experiments (notably Waterbirds), showing improved predictive accuracy and F1 scores on the target, especially for the unobserved subpopulation, over naive benchmarks. Overall, the framework offers a rigorous characterization and practical solution for robust domain adaptation when a source subpopulation is entirely missing, with extensions to multi-class and multi-environment settings discussed. This work thus provides a principled path to reliable predictions under structured missingness and distributional shifts.

Abstract

We study an unsupervised domain adaptation problem where the source domain consists of subpopulations defined by the binary label $Y$ and a binary background (or environment) $A$. We focus on a challenging setting in which one such subpopulation in the source domain is unobservable. Naively ignoring this unobserved group can result in biased estimates and degraded predictive performance. Despite this structured missingness, we show that the prediction in the target domain can still be recovered. Specifically, we rigorously derive both background-specific and overall prediction models for the target domain. For practical implementation, we propose the distribution matching method to estimate the subpopulation proportions. We provide theoretical guarantees for the asymptotic behavior of our estimator, and establish an upper bound on the prediction error. Experiments on both synthetic and real-world datasets show that our method outperforms the naive benchmark that does not account for this unobservable source subpopulation.

Unsupervised Domain Adaptation for Binary Classification with an Unobservable Source Subpopulation

TL;DR

This work addresses unsupervised domain adaptation for binary classification under structured missingness, where the source dataset lacks the subpopulation defined by . Under a conditional invariance assumption , the authors derive target-domain predictions , , and that depend on observable quantities and target-subpopulation proportions. To estimate these proportions, they formulate a KL-divergence-based distribution-matching approach for the vector , achieving statistical consistency and providing a generalization bound for the resulting classifier that scales with the estimation error and the hypothesis class complexity. Theoretical guarantees are complemented by synthetic and real-data experiments (notably Waterbirds), showing improved predictive accuracy and F1 scores on the target, especially for the unobserved subpopulation, over naive benchmarks. Overall, the framework offers a rigorous characterization and practical solution for robust domain adaptation when a source subpopulation is entirely missing, with extensions to multi-class and multi-environment settings discussed. This work thus provides a principled path to reliable predictions under structured missingness and distributional shifts.

Abstract

We study an unsupervised domain adaptation problem where the source domain consists of subpopulations defined by the binary label and a binary background (or environment) . We focus on a challenging setting in which one such subpopulation in the source domain is unobservable. Naively ignoring this unobserved group can result in biased estimates and degraded predictive performance. Despite this structured missingness, we show that the prediction in the target domain can still be recovered. Specifically, we rigorously derive both background-specific and overall prediction models for the target domain. For practical implementation, we propose the distribution matching method to estimate the subpopulation proportions. We provide theoretical guarantees for the asymptotic behavior of our estimator, and establish an upper bound on the prediction error. Experiments on both synthetic and real-world datasets show that our method outperforms the naive benchmark that does not account for this unobservable source subpopulation.

Paper Structure

This paper contains 19 sections, 8 theorems, 96 equations, 6 figures, 3 tables, 1 algorithm.

Key Result

Proposition 1

Define the model $\tau_0({\bf x}) = \hbox{pr}(A=1 \mid {\bf X}={\bf x}, R=0)$ and the model both of which can be implemented using the observed data in our UDA setting. Then the three prediction models in the target domain are given by:

Figures (6)

  • Figure 1: The left panel displays the performance of the F$_1$ score and accuracy for $\eta_1({\bf x})$ and $\xi_1({\bf x})$ across different scenarios, while the right panel presents the corresponding results for $\eta({\bf x})$ and $\xi({\bf x})$.
  • Figure 2: The left panel displays the performance of the F$_1$ score and accuracy for $\eta_1({\bf x})$ and $\xi_1({\bf x})$ across different scenarios, while the right panel presents the corresponding results for $\eta({\bf x})$ and $\xi({\bf x})$.
  • Figure 3: Performance comparison of our proposed estimators $\eta_1({\bf x})$, $\eta({\bf x})$, and the benchmark method $\xi_1({\bf x})$, $\xi({\bf x})$ under the setting $a = 0.5$ with either $c=0.5$ and varying $b$ or $b=0.5$ and varying $c$.
  • Figure 4: Performance comparison of our proposed estimator $\eta_0({\bf x})$, and the benchmark method $\xi_0({\bf x})$ under the setting $a = 0.5$ with either $c=0.5$ and varying $b$ or $b=0.5$ and varying $c$.
  • Figure 5: Performance comparison of our proposed estimators $\eta_1({\bf x})$, $\eta({\bf x})$, and the benchmark method $\xi_1({\bf x})$, $\xi({\bf x})$ under the setting $a = 0.7$ with either $c=0.5$ and varying $b$ or $b=0.5$ and varying $c$.
  • ...and 1 more figures

Theorems & Definitions (17)

  • Proposition 1
  • Lemma 1
  • Lemma 2
  • Remark 1
  • Theorem 1
  • Proposition 2
  • Remark 2
  • proof : Proof of Proposition \ref{['pro:relation']}
  • proof : Proof of Lemma \ref{['lem:iden']}
  • proof : Proof of Lemma \ref{['lem:max']}
  • ...and 7 more