Asymptotics of Stochastic Gradient Descent with Dropout Regularization in Linear Models

Jiaqi Li; Johannes Schmidt-Hieber; Wei Biao Wu

Asymptotics of Stochastic Gradient Descent with Dropout Regularization in Linear Models

Jiaqi Li, Johannes Schmidt-Hieber, Wei Biao Wu

TL;DR

This work develops an online inference framework for stochastic gradient descent with dropout in linear models by establishing geometric-moment contraction and quenched central limit theorems for both GD and Ruppert-Polyak averaged SGD with dropout. It shows the existence of a unique stationary distribution $\pi_{\alpha}$ under a learning-rate condition and derives explicit long-run covariance structures, enabling online estimation. An online long-run covariance estimator based on non-overlapping batched means is proposed, with rigorous guarantees and asymptotic coverage for both joint and one-dimensional projections. The theory is supported by simulations that confirm contraction properties, accurate long-run covariance estimation, and valid confidence intervals in high-dimensional, online settings. Overall, the paper provides a rigorous asymptotic framework and practical tools for uncertainty quantification in SGD with dropout in streaming data scenarios.

Abstract

This paper proposes an asymptotic theory for online inference of the stochastic gradient descent (SGD) iterates with dropout regularization in linear regression. Specifically, we establish the geometric-moment contraction (GMC) for constant step-size SGD dropout iterates to show the existence of a unique stationary distribution of the dropout recursive function. By the GMC property, we provide quenched central limit theorems (CLT) for the difference between dropout and $\ell^2$-regularized iterates, regardless of initialization. The CLT for the difference between the Ruppert-Polyak averaged SGD (ASGD) with dropout and $\ell^2$-regularized iterates is also presented. Based on these asymptotic normality results, we further introduce an online estimator for the long-run covariance matrix of ASGD dropout to facilitate inference in a recursive manner with efficiency in computational time and memory. The numerical experiments demonstrate that for sufficiently large samples, the proposed confidence intervals for ASGD with dropout nearly achieve the nominal coverage probability.

Asymptotics of Stochastic Gradient Descent with Dropout Regularization in Linear Models

TL;DR

under a learning-rate condition and derives explicit long-run covariance structures, enabling online estimation. An online long-run covariance estimator based on non-overlapping batched means is proposed, with rigorous guarantees and asymptotic coverage for both joint and one-dimensional projections. The theory is supported by simulations that confirm contraction properties, accurate long-run covariance estimation, and valid confidence intervals in high-dimensional, online settings. Overall, the paper provides a rigorous asymptotic framework and practical tools for uncertainty quantification in SGD with dropout in streaming data scenarios.

Abstract

-regularized iterates, regardless of initialization. The CLT for the difference between the Ruppert-Polyak averaged SGD (ASGD) with dropout and

-regularized iterates is also presented. Based on these asymptotic normality results, we further introduce an online estimator for the long-run covariance matrix of ASGD dropout to facilitate inference in a recursive manner with efficiency in computational time and memory. The numerical experiments demonstrate that for sufficiently large samples, the proposed confidence intervals for ASGD with dropout nearly achieve the nominal coverage probability.

Paper Structure (41 sections, 27 theorems, 280 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 41 sections, 27 theorems, 280 equations, 5 figures, 4 tables, 1 algorithm.

Introduction
Background
Notation
Dropout Regularization
Asymptotic Properties of Dropout in GD
Geometric-Moment Contraction (GMC)
Iterative Dropout Schemes
Dropout with Ruppert-Polyak Averaging
Generalization to Stochastic Gradient Descent
Dropout Regularization in SGD
GMC of Dropout in SGD
Asymptotics of Dropout in SGD
Online Inference for SGD with Dropout
Simulation Studies
Sharp Range of the Learning Rate
...and 26 more sections

Key Result

Lemma 1

If $q>1$ and $\alpha\|\mathbb X\|<2$, then, $r_{\alpha,q}<1.$

Figures (5)

Figure 1: Convergence traces of AGD and ASGD iterates with dropout regularization based on a single run, with dimension $d=10$ and initialization at zero. The coordinates of the true parameter $\boldsymbol \beta^*$ are equidistantly spaced between 0 and 1, the learning rate $\alpha=0.01$, and the retaining probability $p=0.9$. Each curve represents the convergence trace of one coordinate.
Figure 2: Estimated long-run variances of ASGD dropout iterates, i.e., diagonals of the estimated long-run covariance matrix $\hat{\Sigma}_n(\alpha)$ for the same setting as in Figure \ref{['fig:gmc_example']}.
Figure 3: Length of the joint CI for the one-dimensional projection $\bm{v}^{\top}\boldsymbol \beta^*$ of the ASGD dropout iterates for the same setting as in Figure \ref{['fig:gmc_example']}.
Figure 4: Coverage probabilities of 95% CI for ASGD dropout iterates averaged over $d$ coordinates from 200 independent runs. Red dashed line denotes the nominal coverage rate of 0.95. Dimension $d=10$, $p=0.9$, $\alpha=0.01$, and coordinates of $\boldsymbol \beta^*$ are equidistantly spaced between 0 and 1 with initializations at zero.
Figure 5: Coverage probabilities of 95% joint confidence intervals for one-dimensional projection of ASGD dropout iterates from 200 independent runs. Red dashed line denotes the nominal coverage rate of 0.95. Dimension $d=50$, $p=0.9$, $\alpha=0.01$, and coordinates of $\boldsymbol \beta^*$ are equidistantly spaced between 0 and 1 with initializations at zero.

Theorems & Definitions (53)

Definition 1: Geometric-moment contraction
Lemma 1: Learning-rate range in GD dropout
Theorem 1: Geometric-moment contraction of GD dropout
Lemma 2: Affine approximation
Lemma 3: Moment convergence of iterative GD dropout
Theorem 2: Quenched CLT of iterative GD dropout
Theorem 3: Quenched CLT of averaged GD dropout
Corollary 1: Quenched CLT of parallel averaged GD dropout
Theorem 4: Quenched invariance principle of averaged GD dropout
Lemma 4: Learning-rate range in SGD dropout
...and 43 more

Asymptotics of Stochastic Gradient Descent with Dropout Regularization in Linear Models

TL;DR

Abstract

Asymptotics of Stochastic Gradient Descent with Dropout Regularization in Linear Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (53)