Table of Contents
Fetching ...

Open-Set Heterogeneous Domain Adaptation: Theoretical Analysis and Algorithm

Thai-Hoang Pham, Yuanlong Wang, Changchang Yin, Xueru Zhang, Ping Zhang

TL;DR

This work defines Open-Set Heterogeneous Domain Adaptation (OSHeDA) to address simultaneous feature-space heterogeneity and unseen classes between source and target domains. It develops a theoretical framework with learning bounds that connect target risk to source risk, open-set differences, and JS-divergence-based domain distances, and derives both infinite-data and finite-data results, contrasting with HoDA bounds under covariate shift. Guided by these insights, the paper introduces RL-OSHeDA, a two-stage representation-learning method that maps heterogeneous inputs to a shared space, aligns known-class representations via centroid-based measures, and uses a non-negative open-set risk with pseudo-labeling to identify novel classes. Empirical results across seven diverse datasets (including vision, text, and clinical ECG data) show RL-OSHeDA outperforming state-of-the-art baselines on the OSHeDA task, with ablations highlighting the importance of each component and the pseudo-labeling strategy. The work advances practical cross-domain transfer in settings with both feature and label space mismatch and emerging unseen classes, enabling more robust deployment in real-world heterogeneous environments.

Abstract

Domain adaptation (DA) tackles the issue of distribution shift by learning a model from a source domain that generalizes to a target domain. However, most existing DA methods are designed for scenarios where the source and target domain data lie within the same feature space, which limits their applicability in real-world situations. Recently, heterogeneous DA (HeDA) methods have been introduced to address the challenges posed by heterogeneous feature space between source and target domains. Despite their successes, current HeDA techniques fall short when there is a mismatch in both feature and label spaces. To address this, this paper explores a new DA scenario called open-set HeDA (OSHeDA). In OSHeDA, the model must not only handle heterogeneity in feature space but also identify samples belonging to novel classes. To tackle this challenge, we first develop a novel theoretical framework that constructs learning bounds for prediction error on target domain. Guided by this framework, we propose a new DA method called Representation Learning for OSHeDA (RL-OSHeDA). This method is designed to simultaneously transfer knowledge between heterogeneous data sources and identify novel classes. Experiments across text, image, and clinical data demonstrate the effectiveness of our algorithm. Model implementation is available at \url{https://github.com/pth1993/OSHeDA}.

Open-Set Heterogeneous Domain Adaptation: Theoretical Analysis and Algorithm

TL;DR

This work defines Open-Set Heterogeneous Domain Adaptation (OSHeDA) to address simultaneous feature-space heterogeneity and unseen classes between source and target domains. It develops a theoretical framework with learning bounds that connect target risk to source risk, open-set differences, and JS-divergence-based domain distances, and derives both infinite-data and finite-data results, contrasting with HoDA bounds under covariate shift. Guided by these insights, the paper introduces RL-OSHeDA, a two-stage representation-learning method that maps heterogeneous inputs to a shared space, aligns known-class representations via centroid-based measures, and uses a non-negative open-set risk with pseudo-labeling to identify novel classes. Empirical results across seven diverse datasets (including vision, text, and clinical ECG data) show RL-OSHeDA outperforming state-of-the-art baselines on the OSHeDA task, with ablations highlighting the importance of each component and the pseudo-labeling strategy. The work advances practical cross-domain transfer in settings with both feature and label space mismatch and emerging unseen classes, enabling more robust deployment in real-world heterogeneous environments.

Abstract

Domain adaptation (DA) tackles the issue of distribution shift by learning a model from a source domain that generalizes to a target domain. However, most existing DA methods are designed for scenarios where the source and target domain data lie within the same feature space, which limits their applicability in real-world situations. Recently, heterogeneous DA (HeDA) methods have been introduced to address the challenges posed by heterogeneous feature space between source and target domains. Despite their successes, current HeDA techniques fall short when there is a mismatch in both feature and label spaces. To address this, this paper explores a new DA scenario called open-set HeDA (OSHeDA). In OSHeDA, the model must not only handle heterogeneity in feature space but also identify samples belonging to novel classes. To tackle this challenge, we first develop a novel theoretical framework that constructs learning bounds for prediction error on target domain. Guided by this framework, we propose a new DA method called Representation Learning for OSHeDA (RL-OSHeDA). This method is designed to simultaneously transfer knowledge between heterogeneous data sources and identify novel classes. Experiments across text, image, and clinical data demonstrate the effectiveness of our algorithm. Model implementation is available at \url{https://github.com/pth1993/OSHeDA}.

Paper Structure

This paper contains 49 sections, 10 theorems, 43 equations, 5 figures, 21 tables, 1 algorithm.

Key Result

Theorem 1

Given a loss function $L$ satisfying Assumption ass:1, then for any $h \in \mathcal{H}, f_s \in \mathcal{F}_s, f_t \in \mathcal{F}_t$, we have:

Figures (5)

  • Figure 1: A motivating example about OSHeDA in the context of screening diseases using electrocardiogram (ECG) data. While digital ECGs comprise the majority of labeled data for training ML models for disease screening, physical or paper ECGs remain prevalent worldwide. Thus, the transfer of knowledge from digital ECG datasets is essential to support the training of ML models that analyze paper ECGs. Moreover, ML systems must effectively manage rare abnormalities (indicated with gray boxes), which may not be available in training data, to prevent misdiagnosis.
  • Figure 2: Overall architecture of RL-OSHeDA is illustrated with a motivating example from ECG-based diagnosis application. We leverage 2-stage learning process to update model parameters. In stage 1, model parameters are updated by optimizing $L_{cls}$. In stage 2, model parameters are updated by optimizing $L_{cls}$, $L_{inv}$, $L_{seg}$, and $L_{osd}$ with the help from pseudo-label model $g$.
  • Figure 3: Critical Difference diagram for all methods calculated from 56 DA tasks. RL-OSHeDA is the highest ranked method on $HOS$ metric, and its performance is significantly better than baselines (as indicated by the lack of connections between RL-OSHeDA and baselines in the diagram).
  • Figure 4: Performances w.r.t. different number of labeled target instances per class on CIFAR10 & ILSVRC2012 dataset.
  • Figure 5: Visualization of representation spaces learned by RL-OSHeDA and STN for NUSWIDE & ImageNet dataset. Different colors represent different classes, with the unknown class denoted in grey.

Theorems & Definitions (14)

  • Theorem 1
  • Remark 1
  • Proposition 1
  • Remark 2
  • Proposition 2: Adapted from biau2020some
  • Remark 3
  • Theorem 2
  • Proposition 3
  • Lemma 1
  • Lemma 2
  • ...and 4 more