Heterogeneous Multisource Transfer Learning via Model Averaging for Positive-Unlabeled Data
Jialei Liu, Jun Liao, Kuangnan Fang
TL;DR
The paper tackles positive-unlabeled (PU) learning under privacy constraints by leveraging information from multiple heterogeneous source domains. It introduces TLMA-PU, a framework that fits domain-specific logistic regression models for fully labeled, semi-supervised, and PU sources and then transfers knowledge through a weighted average of parameters, with weights optimized by cross-validated KL-divergence criteria. The authors establish asymptotic weight optimality under misspecification, and weight convergence when the target model is correctly specified, including extensions to high-dimensional sparse settings. Empirical results from simulations and a real credit-risk dataset show that TLMA-PU improves predictive accuracy and robustness, especially with limited labeled target data and diverse data sources. The approach preserves privacy by sharing only parameter vectors and offers a principled, theoretically grounded solution for cross-domain PU learning in practical risk-control tasks.
Abstract
Positive-Unlabeled (PU) learning presents unique challenges due to the lack of explicitly labeled negative samples, particularly in high-stakes domains such as fraud detection and medical diagnosis. To address data scarcity and privacy constraints, we propose a novel transfer learning with model averaging framework that integrates information from heterogeneous data sources - including fully binary labeled, semi-supervised, and PU data sets - without direct data sharing. For each source domain type, a tailored logistic regression model is conducted, and knowledge is transferred to the PU target domain through model averaging. Optimal weights for combining source models are determined via a cross-validation criterion that minimizes the Kullback-Leibler divergence. We establish theoretical guarantees for weight optimality and convergence, covering both misspecified and correctly specified target models, with further extensions to high-dimensional settings using sparsity-penalized estimators. Extensive simulations and real-world credit risk data analyses demonstrate that our method outperforms other comparative methods in terms of predictive accuracy and robustness, especially under limited labeled data and heterogeneous environments.
