Table of Contents
Fetching ...

Domain Adaptation with Cauchy-Schwarz Divergence

Wenzhe Yin, Shujian Yu, Yicong Lin, Jie Liu, Jan-Jakob Sonke, Efstratios Gavves

TL;DR

This work introduces CS and CCS as principled divergences for unsupervised domain adaptation, enabling simultaneous alignment of marginal representations $p^s(\mathbf{z})$ and conditional distributions $p^s(y|\mathbf{z})$ with their target counterparts. It derives a CS-based generalization bound that can be tighter than the KL-based bound and provides nonparametric, kernel-based estimators for both marginal and conditional discrepancies. The authors implement two training paradigms—a distance-metric approach (CS/CCS) and an adversarial variant (CS-adv)—and demonstrate superior performance across digits, Office-Home, Office-31, and VisDA17 datasets, with KL-based methods often unstable. They also show CCS can be plugged into existing UDA frameworks (e.g., f-DAL, kSHOT) to further improve results, highlighting the practical impact of joint distribution alignment in real-world domain shifts.

Abstract

Domain adaptation aims to use training data from one or multiple source domains to learn a hypothesis that can be generalized to a different, but related, target domain. As such, having a reliable measure for evaluating the discrepancy of both marginal and conditional distributions is crucial. We introduce Cauchy-Schwarz (CS) divergence to the problem of unsupervised domain adaptation (UDA). The CS divergence offers a theoretically tighter generalization error bound than the popular Kullback-Leibler divergence. This holds for the general case of supervised learning, including multi-class classification and regression. Furthermore, we illustrate that the CS divergence enables a simple estimator on the discrepancy of both marginal and conditional distributions between source and target domains in the representation space, without requiring any distributional assumptions. We provide multiple examples to illustrate how the CS divergence can be conveniently used in both distance metric- or adversarial training-based UDA frameworks, resulting in compelling performance.

Domain Adaptation with Cauchy-Schwarz Divergence

TL;DR

This work introduces CS and CCS as principled divergences for unsupervised domain adaptation, enabling simultaneous alignment of marginal representations and conditional distributions with their target counterparts. It derives a CS-based generalization bound that can be tighter than the KL-based bound and provides nonparametric, kernel-based estimators for both marginal and conditional discrepancies. The authors implement two training paradigms—a distance-metric approach (CS/CCS) and an adversarial variant (CS-adv)—and demonstrate superior performance across digits, Office-Home, Office-31, and VisDA17 datasets, with KL-based methods often unstable. They also show CCS can be plugged into existing UDA frameworks (e.g., f-DAL, kSHOT) to further improve results, highlighting the practical impact of joint distribution alignment in real-world domain shifts.

Abstract

Domain adaptation aims to use training data from one or multiple source domains to learn a hypothesis that can be generalized to a different, but related, target domain. As such, having a reliable measure for evaluating the discrepancy of both marginal and conditional distributions is crucial. We introduce Cauchy-Schwarz (CS) divergence to the problem of unsupervised domain adaptation (UDA). The CS divergence offers a theoretically tighter generalization error bound than the popular Kullback-Leibler divergence. This holds for the general case of supervised learning, including multi-class classification and regression. Furthermore, we illustrate that the CS divergence enables a simple estimator on the discrepancy of both marginal and conditional distributions between source and target domains in the representation space, without requiring any distributional assumptions. We provide multiple examples to illustrate how the CS divergence can be conveniently used in both distance metric- or adversarial training-based UDA frameworks, resulting in compelling performance.
Paper Structure (37 sections, 14 theorems, 92 equations, 13 figures, 7 tables, 1 algorithm)

This paper contains 37 sections, 14 theorems, 92 equations, 13 figures, 7 tables, 1 algorithm.

Key Result

Proposition 1

For any $d$-variate Gaussian distributions $p\sim \mathcal{N}(\mu_1,\Sigma_1)$ and $q\sim \mathcal{N}(\mu_2,\Sigma_2)$, where $\Sigma_1$ and $\Sigma_2$ are positive definite, we have:

Figures (13)

  • Figure 1: A graphical illustration of the sets $\mathcal{A}_{\epsilon}$ and $\mathcal{A}_{\epsilon}^{\complement}$ defined in Proposition \ref{['proposition_general_TV']}.
  • Figure 2: The framework of the proposed conditional bi-classifier adversarial learning method with CS and CCS divergences. Feature extractor $f$ is used to obtain representations $\mathbf{z}^s$ and $\mathbf{z}^t$ for the source and target domains, respectively. Two classifiers $g_1$ and $g_2$ are used as a discriminator. CS divergence directly minimizes the discrepancy of $p(\mathbf{z})$ between two domains. CCS divergence measures the disagreement between two classifiers (adversarial loss).
  • Figure 3: The ablation study of the CS and CCS components in MNIST to USPS task, comparing with MMD and joint distribution MMD (JPMMD).
  • Figure 4: Integrating CCS with f-DAL in M$\rightarrow$U task.
  • Figure 5: Values of $D_{\text{TV}}$ and $\sqrt{D_{\text{CS}}}$ for 1-dimensional Gaussian data in case (a) $\mu$ is different, $\sigma>0$ is the same; and (b) $\sigma$ is different, $\mu$ is the same.
  • ...and 8 more figures

Theorems & Definitions (26)

  • Proposition 1
  • proof
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • Proposition 5: Empirical Estimator of $D_{\text{CS}}(p^s(\mathbf{z});p^t(\mathbf{z}))$ jenssen2006cauchy
  • Remark 1
  • Proposition 6: Empirical Estimator of $D_{\text{CCS}}(p^s(\hat{y}|\mathbf{z});p^t(\hat{y}|\mathbf{z}))$ yu2023conditional
  • Remark 2
  • Proposition 1
  • ...and 16 more