Domain Adaptation and Entanglement: an Optimal Transport Perspective
Okan Koç, Alexander Soen, Chao-Kai Chiang, Masashi Sugiyama
TL;DR
The paper tackles robustness to distribution shifts by formulating unsupervised domain adaptation (UDA) bounds through optimal transport. It introduces entanglement, an unoptimizable component capturing how aligning marginals can degrade target accuracy via misaligned conditionals. The authors develop a theoretical OT-based framework, define label- and prediction-entanglement, and derive an Oracle Bound that connects source risk, marginal alignment, and entanglement. Empirically, entanglement explains why some domain-matching methods fail under distribution shifts and how assumptions like Close Conditionals and Gradual Shift can improve transfer, with practical implications for choosing loss functions, models, and optimization strategies in UDA tasks.
Abstract
Current machine learning systems are brittle in the face of distribution shifts (DS), where the target distribution that the system is tested on differs from the source distribution used to train the system. This problem of robustness to DS has been studied extensively in the field of domain adaptation. For deep neural networks, a popular framework for unsupervised domain adaptation (UDA) is domain matching, in which algorithms try to align the marginal distributions in the feature or output space. The current theoretical understanding of these methods, however, is limited and existing theoretical results are not precise enough to characterize their performance in practice. In this paper, we derive new bounds based on optimal transport that analyze the UDA problem. Our new bounds include a term which we dub as \emph{entanglement}, consisting of an expectation of Wasserstein distance between conditionals with respect to changing data distributions. Analysis of the entanglement term provides a novel perspective on the unoptimizable aspects of UDA. In various experiments with multiple models across several DS scenarios, we show that this term can be used to explain the varying performance of UDA algorithms.
