Table of Contents
Fetching ...

Heterogeneous Multisource Transfer Learning via Model Averaging for Positive-Unlabeled Data

Jialei Liu, Jun Liao, Kuangnan Fang

TL;DR

The paper tackles positive-unlabeled (PU) learning under privacy constraints by leveraging information from multiple heterogeneous source domains. It introduces TLMA-PU, a framework that fits domain-specific logistic regression models for fully labeled, semi-supervised, and PU sources and then transfers knowledge through a weighted average of parameters, with weights optimized by cross-validated KL-divergence criteria. The authors establish asymptotic weight optimality under misspecification, and weight convergence when the target model is correctly specified, including extensions to high-dimensional sparse settings. Empirical results from simulations and a real credit-risk dataset show that TLMA-PU improves predictive accuracy and robustness, especially with limited labeled target data and diverse data sources. The approach preserves privacy by sharing only parameter vectors and offers a principled, theoretically grounded solution for cross-domain PU learning in practical risk-control tasks.

Abstract

Positive-Unlabeled (PU) learning presents unique challenges due to the lack of explicitly labeled negative samples, particularly in high-stakes domains such as fraud detection and medical diagnosis. To address data scarcity and privacy constraints, we propose a novel transfer learning with model averaging framework that integrates information from heterogeneous data sources - including fully binary labeled, semi-supervised, and PU data sets - without direct data sharing. For each source domain type, a tailored logistic regression model is conducted, and knowledge is transferred to the PU target domain through model averaging. Optimal weights for combining source models are determined via a cross-validation criterion that minimizes the Kullback-Leibler divergence. We establish theoretical guarantees for weight optimality and convergence, covering both misspecified and correctly specified target models, with further extensions to high-dimensional settings using sparsity-penalized estimators. Extensive simulations and real-world credit risk data analyses demonstrate that our method outperforms other comparative methods in terms of predictive accuracy and robustness, especially under limited labeled data and heterogeneous environments.

Heterogeneous Multisource Transfer Learning via Model Averaging for Positive-Unlabeled Data

TL;DR

The paper tackles positive-unlabeled (PU) learning under privacy constraints by leveraging information from multiple heterogeneous source domains. It introduces TLMA-PU, a framework that fits domain-specific logistic regression models for fully labeled, semi-supervised, and PU sources and then transfers knowledge through a weighted average of parameters, with weights optimized by cross-validated KL-divergence criteria. The authors establish asymptotic weight optimality under misspecification, and weight convergence when the target model is correctly specified, including extensions to high-dimensional sparse settings. Empirical results from simulations and a real credit-risk dataset show that TLMA-PU improves predictive accuracy and robustness, especially with limited labeled target data and diverse data sources. The approach preserves privacy by sharing only parameter vectors and offers a principled, theoretically grounded solution for cross-domain PU learning in practical risk-control tasks.

Abstract

Positive-Unlabeled (PU) learning presents unique challenges due to the lack of explicitly labeled negative samples, particularly in high-stakes domains such as fraud detection and medical diagnosis. To address data scarcity and privacy constraints, we propose a novel transfer learning with model averaging framework that integrates information from heterogeneous data sources - including fully binary labeled, semi-supervised, and PU data sets - without direct data sharing. For each source domain type, a tailored logistic regression model is conducted, and knowledge is transferred to the PU target domain through model averaging. Optimal weights for combining source models are determined via a cross-validation criterion that minimizes the Kullback-Leibler divergence. We establish theoretical guarantees for weight optimality and convergence, covering both misspecified and correctly specified target models, with further extensions to high-dimensional settings using sparsity-penalized estimators. Extensive simulations and real-world credit risk data analyses demonstrate that our method outperforms other comparative methods in terms of predictive accuracy and robustness, especially under limited labeled data and heterogeneous environments.

Paper Structure

This paper contains 12 sections, 6 theorems, 20 equations, 5 figures.

Key Result

Theorem 3.1

Suppose Conditions c:unique-c: xi and n hold, then we have $\hat{\boldsymbol{w}}$ is asymptotically optimal in the sense that

Figures (5)

  • Figure 1: Comparison of the three data labeling paradigms.
  • Figure 2: The process of transfer learning on PU data.
  • Figure 3: $\widehat{\text{KL}}(\hat{\boldsymbol{w}})/\inf_{w\in\mathbb{W}}\widehat{\text{KL}}(\boldsymbol{w})$ of Case 1 under different training sample sizes.
  • Figure 4: The sum of weights assigned to uninformative models in Case 2.
  • Figure 5: The AUC_adj of compared methods.

Theorems & Definitions (10)

  • Remark 2.1
  • Remark 2.2
  • Remark 2.3
  • Theorem 3.1
  • Remark 3.1
  • Theorem 3.2
  • Theorem 3.3
  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.3