Table of Contents
Fetching ...

Fused Multinomial Logistic Regression Utilizing Summary-Level External Machine-learning Information

Chi-Shian Dai, Jun Shao

Abstract

In many modern applications, a carefully designed primary study provides individual-level data for interpretable modeling, while summary-level external information is available through black-box, efficient, and nonparametric machine-learning predictions. Although summary-level external information has been studied in the data integration literature, there is limited methodology for leveraging external nonparametric machine-learning predictions to improve statistical inference in the primary study. We propose a general empirical-likelihood framework that incorporates external predictions through moment constraints. An advantage of nonparametric machine-learning prediction is that it induces a rich class of valid moment restrictions that remain robust to covariate shift under a mild overlap condition without requiring explicit density-ratio modeling. We focus on multinomial logistic regression as the primary model and address common data-quality issues in external sources, including coarsened outcomes, partially observed covariates, covariate shift, and heterogeneity in generating mechanisms known as concept shift. We establish large-sample properties of the resulting fused estimator, including consistency and asymptotic normality under regularity conditions. Moreover, we provide mild sufficient conditions under which incorporating external predictions delivers a strict efficiency gain relative to the primary-only estimator. Simulation studies and an application to the National Health and Nutrition Examination Survey on multiclass blood-pressure classification.

Fused Multinomial Logistic Regression Utilizing Summary-Level External Machine-learning Information

Abstract

In many modern applications, a carefully designed primary study provides individual-level data for interpretable modeling, while summary-level external information is available through black-box, efficient, and nonparametric machine-learning predictions. Although summary-level external information has been studied in the data integration literature, there is limited methodology for leveraging external nonparametric machine-learning predictions to improve statistical inference in the primary study. We propose a general empirical-likelihood framework that incorporates external predictions through moment constraints. An advantage of nonparametric machine-learning prediction is that it induces a rich class of valid moment restrictions that remain robust to covariate shift under a mild overlap condition without requiring explicit density-ratio modeling. We focus on multinomial logistic regression as the primary model and address common data-quality issues in external sources, including coarsened outcomes, partially observed covariates, covariate shift, and heterogeneity in generating mechanisms known as concept shift. We establish large-sample properties of the resulting fused estimator, including consistency and asymptotic normality under regularity conditions. Moreover, we provide mild sufficient conditions under which incorporating external predictions delivers a strict efficiency gain relative to the primary-only estimator. Simulation studies and an application to the National Health and Nutrition Examination Survey on multiclass blood-pressure classification.

Paper Structure

This paper contains 26 sections, 4 theorems, 50 equations, 2 figures, 3 tables.

Key Result

Theorem 1

Under Assumptions 1(i), 2-3 and the regularity conditions (C1)-(C3) stated in the Appendix, any maximizer $\widehat{{\boldsymbol\gamma}}$ of $\ell_n({\boldsymbol\gamma} \mid \widehat{{\boldsymbol q}} \, )$ is consistent, that is, $\widehat{{\boldsymbol\gamma}} \;\xrightarrow{p}\; {\boldsymbol\gamma}

Figures (2)

  • Figure 1: Standardized mean differences for shared covariates between the primary and external sources.
  • Figure 2: Point estimates (dots) and confidence intervals (vertical bars).

Theorems & Definitions (4)

  • Theorem 1: Consistency
  • Theorem 2: Asymptotic normality
  • Theorem 3
  • Theorem 4