Adaptive debiased SGD in high-dimensional GLMs with streaming data

Ruijian Han; Lan Luo; Yuanhang Luo; Yuanyuan Lin; Jian Huang

Adaptive debiased SGD in high-dimensional GLMs with streaming data

Ruijian Han, Lan Luo, Yuanhang Luo, Yuanyuan Lin, Jian Huang

TL;DR

This work addresses online inference in high-dimensional generalized linear models with streaming data by developing the Adaptive Debiased Lasso (ADL). ADL combines an adaptive RADAR online lasso for coefficient estimation, online nodewise lasso for debiasing, and Taylor-based approximations to create a one-pass method that maintains only $O(p)$ space while enabling asymptotically valid confidence intervals. Theoretical results establish oracle inequalities for the online estimators and asymptotic normality of the ADL debiased estimator under realistic online-design conditions. Empirical studies, including simulations and a spam classification application, demonstrate that ADL achieves competitive statistical accuracy with substantial computational efficiency, enabling real-time inference in ultra-high dimensional streaming settings.

Abstract

Online statistical inference facilitates real-time analysis of sequentially collected data, making it different from traditional methods that rely on static datasets. This paper introduces a novel approach to online inference in high-dimensional generalized linear models, where we update regression coefficient estimates and their standard errors upon each new data arrival. In contrast to existing methods that either require full dataset access or large-dimensional summary statistics storage, our method operates in a single-pass mode, significantly reducing both time and space complexity. The core of our methodological innovation lies in an adaptive stochastic gradient descent algorithm tailored for dynamic objective functions, coupled with a novel online debiasing procedure. This allows us to maintain low-dimensional summary statistics while effectively controlling the optimization error introduced by the dynamically changing loss functions. We establish the asymptotic normality of our proposed Adaptive Debiased Lasso (ADL) estimator. We conduct extensive simulation experiments to show the statistical validity and computational efficiency of our ADL estimator across various settings. Its computational efficiency is further demonstrated via a real data application to the spam email classification.

Adaptive debiased SGD in high-dimensional GLMs with streaming data

TL;DR

space while enabling asymptotically valid confidence intervals. Theoretical results establish oracle inequalities for the online estimators and asymptotic normality of the ADL debiased estimator under realistic online-design conditions. Empirical studies, including simulations and a spam classification application, demonstrate that ADL achieves competitive statistical accuracy with substantial computational efficiency, enabling real-time inference in ultra-high dimensional streaming settings.

Abstract

Paper Structure (12 sections, 4 theorems, 24 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 12 sections, 4 theorems, 24 equations, 4 figures, 3 tables, 1 algorithm.

Introduction
Methodology
Online lasso with RADAR
Online nodewise lasso with adaptive RADAR
Approximated debiasing techniques
Variance and confidence interval
Theoretical properties
Oracle inequalities
Asymptotic normality
Simulation studies
Real data example: spam email classification
Conclusion

Key Result

Theorem 3.1

Suppose that Assumptions assump_1--assump_3 hold. With the choice of hyper-parameters according to (S.1)--(S.3) in the supplementary material, the following events for $i \geq n_1$ hold uniformly for a universal constant $C_1$ with probability at least $1 - 7(\log p)^{-6}.$

Figures (4)

Figure 1: The whole procedure in conducting online statistical inference.
Figure 2: The timeline of three key steps in constructing the ADL estimator.
Figure 3: Simulation results averaged over 500 replications (each takes around 42.452s) with $n=1000$, $p=20000$, $s_0=20$, $\boldsymbol{\Sigma}= \{0.5^{|i-j|} \}_{i,j=1,\dots,p}$.
Figure 4: 99% confidence interval estimate for "investment", "schedule", and "per cent" (upper left, upper right, and bottom left), where lines and shaded areas represent the traces of the ADL estimator and 99% confidence interval respectively. The prediction error evaluated on the test set, as training sample size increases, is shown in the bottom right.

Theorems & Definitions (6)

Remark 1
Theorem 3.1
Remark 2
Theorem 3.2
Theorem 3.3
Corollary 3.4

Adaptive debiased SGD in high-dimensional GLMs with streaming data

TL;DR

Abstract

Adaptive debiased SGD in high-dimensional GLMs with streaming data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (6)