Distributionally Robust Learning for Multi-source Unsupervised Domain Adaptation
Zhenyu Wang, Peter Bühlmann, Zijian Guo
TL;DR
This paper addresses distributional shifts in multi-source unsupervised domain adaptation by introducing Distributionally Robust Learning (DRoL), which uses labeled data from multiple sources and unlabeled target covariates to build robust predictors under a mixture-based uncertainty set over Y|X. The main theoretical result shows the population robust predictor is a weighted average of source conditional means, with aggregation weights obtained by solving a convex quadratic program, and a bias-correction step enhances weight estimation. The authors provide detailed rate results, comparing reward-based robust modeling to squared-error and regret-based alternatives, and demonstrate both computational tractability and privacy-friendly, federated-like properties. Through simulations and a real Beijing PM2.5 dataset, DRoL consistently achieves superior worst-case performance, especially when incorporating informative prior information about the target mixture and applying bias correction. Overall, DRoL offers a principled, scalable, and privacy-conscious approach to robust prediction under covariate shift across multiple sources, with practical impact for domains where target labels are scarce or unavailable.
Abstract
Empirical risk minimization often performs poorly when the distribution of the target domain differs from those of source domains. To address such potential distribution shifts, we develop an unsupervised domain adaptation approach that leverages labeled data from multiple source domains and unlabeled data from the target domain. We introduce a distributionally robust model that optimizes an adversarial reward based on the explained variance across a class of target distributions, ensuring generalization to the target domain. We show that the proposed robust model is a weighted average of conditional outcome models from source domains. This formulation allows us to compute the robust model through the aggregation of source models, which can be estimated using various machine learning algorithms of the users' choice, such as random forests, boosting, and neural networks. Additionally, we introduce a bias-correction step to obtain a more accurate aggregation weight, which is effective for various machine learning algorithms. Our framework can be interpreted as a distributionally robust federated learning approach that satisfies privacy constraints while providing insights into the importance of each source for prediction on the target domain. The performance of our method is evaluated on both simulated and real data.
