Open-Set Heterogeneous Domain Adaptation: Theoretical Analysis and Algorithm
Thai-Hoang Pham, Yuanlong Wang, Changchang Yin, Xueru Zhang, Ping Zhang
TL;DR
This work defines Open-Set Heterogeneous Domain Adaptation (OSHeDA) to address simultaneous feature-space heterogeneity and unseen classes between source and target domains. It develops a theoretical framework with learning bounds that connect target risk to source risk, open-set differences, and JS-divergence-based domain distances, and derives both infinite-data and finite-data results, contrasting with HoDA bounds under covariate shift. Guided by these insights, the paper introduces RL-OSHeDA, a two-stage representation-learning method that maps heterogeneous inputs to a shared space, aligns known-class representations via centroid-based measures, and uses a non-negative open-set risk with pseudo-labeling to identify novel classes. Empirical results across seven diverse datasets (including vision, text, and clinical ECG data) show RL-OSHeDA outperforming state-of-the-art baselines on the OSHeDA task, with ablations highlighting the importance of each component and the pseudo-labeling strategy. The work advances practical cross-domain transfer in settings with both feature and label space mismatch and emerging unseen classes, enabling more robust deployment in real-world heterogeneous environments.
Abstract
Domain adaptation (DA) tackles the issue of distribution shift by learning a model from a source domain that generalizes to a target domain. However, most existing DA methods are designed for scenarios where the source and target domain data lie within the same feature space, which limits their applicability in real-world situations. Recently, heterogeneous DA (HeDA) methods have been introduced to address the challenges posed by heterogeneous feature space between source and target domains. Despite their successes, current HeDA techniques fall short when there is a mismatch in both feature and label spaces. To address this, this paper explores a new DA scenario called open-set HeDA (OSHeDA). In OSHeDA, the model must not only handle heterogeneity in feature space but also identify samples belonging to novel classes. To tackle this challenge, we first develop a novel theoretical framework that constructs learning bounds for prediction error on target domain. Guided by this framework, we propose a new DA method called Representation Learning for OSHeDA (RL-OSHeDA). This method is designed to simultaneously transfer knowledge between heterogeneous data sources and identify novel classes. Experiments across text, image, and clinical data demonstrate the effectiveness of our algorithm. Model implementation is available at \url{https://github.com/pth1993/OSHeDA}.
