Efficient semi-supervised inference for logistic regression under case-control studies

Zhuojun Quan; Yuanyuan Lin; Kani Chen; Wen Yu

Efficient semi-supervised inference for logistic regression under case-control studies

Zhuojun Quan, Yuanyuan Lin, Kani Chen, Wen Yu

TL;DR

This work addresses logistic regression under case-control sampling with unlabeled covariates. It shows that while the intercept $\alpha$ is not identifiable from case-control data alone, the availability of unlabeled data identifies $\alpha$ and enables efficient inference for the slope $\beta$; the authors derive a joint nonparametric likelihood for labeled and unlabeled data and compute an iterative MLE $\hat{\theta}=(\hat{\alpha},\hat{\beta})$ that is consistent, asymptotically normal, and semiparametrically efficient, with an estimator for the marginal case proportion $\mathsf{P}(Y=1)$ as $\hat{P}=\sum_{i=1}^N \phi(x_i;\hat{\theta})\hat{p}_i$. The method discretizes the covariate distribution and alternates updates of $\theta$ and the discretized masses $\mathbf{p}$, yielding a globally improved estimator over using labeled data alone. Large-sample theory shows consistency and asymptotic normality with the slope achieving the semiparametric efficiency bound; simulations and a Pima Indians diabetes data example demonstrate improved intercept identifiability, tighter standard errors for $\beta$, and better predictive performance when unlabeled data are incorporated. The approach provides a principled, efficient way to leverage unlabeled covariates in case-control studies, with practical relevance to biomedical and epidemiological settings.

Abstract

Semi-supervised learning has received increasingly attention in statistics and machine learning. In semi-supervised learning settings, a labeled data set with both outcomes and covariates and an unlabeled data set with covariates only are collected. We consider an inference problem in semi-supervised settings where the outcome in the labeled data is binary and the labeled data is collected by case-control sampling. Case-control sampling is an effective sampling scheme for alleviating imbalance structure in binary data. Under the logistic model assumption, case-control data can still provide consistent estimator for the slope parameter of the regression model. However, the intercept parameter is not identifiable. Consequently, the marginal case proportion cannot be estimated from case-control data. We find out that with the availability of the unlabeled data, the intercept parameter can be identified in semi-supervised learning setting. We construct the likelihood function of the observed labeled and unlabeled data and obtain the maximum likelihood estimator via an iterative algorithm. The proposed estimator is shown to be consistent, asymptotically normal, and semiparametrically efficient. Extensive simulation studies are conducted to show the finite sample performance of the proposed method. The results imply that the unlabeled data not only helps to identify the intercept but also improves the estimation efficiency of the slope parameter. Meanwhile, the marginal case proportion can be estimated accurately by the proposed method.

Efficient semi-supervised inference for logistic regression under case-control studies

TL;DR

This work addresses logistic regression under case-control sampling with unlabeled covariates. It shows that while the intercept

is not identifiable from case-control data alone, the availability of unlabeled data identifies

and enables efficient inference for the slope

; the authors derive a joint nonparametric likelihood for labeled and unlabeled data and compute an iterative MLE

that is consistent, asymptotically normal, and semiparametrically efficient, with an estimator for the marginal case proportion

. The method discretizes the covariate distribution and alternates updates of

and the discretized masses

, yielding a globally improved estimator over using labeled data alone. Large-sample theory shows consistency and asymptotic normality with the slope achieving the semiparametric efficiency bound; simulations and a Pima Indians diabetes data example demonstrate improved intercept identifiability, tighter standard errors for

, and better predictive performance when unlabeled data are incorporated. The approach provides a principled, efficient way to leverage unlabeled covariates in case-control studies, with practical relevance to biomedical and epidemiological settings.

Abstract

Paper Structure (12 sections, 2 theorems, 47 equations, 4 tables, 1 algorithm)

This paper contains 12 sections, 2 theorems, 47 equations, 4 tables, 1 algorithm.

Introduction
Notation and data structure
Main results
Identifiability issue
Maximum likelihood estimation
The algorithm
Large-sample properties
Numerical results
Simulation studies
Pima Indians diabetes data
Concluding remarks
Appendix

Key Result

Theorem 1

(Consistency) Suppose that conditions C1-C3 hold. If $n/N\to c_1$ for some constant $c_1\in(0,1)$ and $n_1/n_0\to c_2$ for some constant $c_2\in(0,\infty)$ as $N\to\infty$, then $|\hat{\theta}-\theta_0|\to0$ and $\sup_{x\in\mathbb{R}^p}|\hat{F}(x)-F_0(x)|\to0$ almost surely.

Theorems & Definitions (2)

Theorem 1
Theorem 2

Efficient semi-supervised inference for logistic regression under case-control studies

TL;DR

Abstract

Efficient semi-supervised inference for logistic regression under case-control studies

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (2)