Demystifying SGD with Doubly Stochastic Gradients

Kyurae Kim; Joohwan Ko; Yi-An Ma; Jacob R. Gardner

Demystifying SGD with Doubly Stochastic Gradients

Kyurae Kim, Joohwan Ko, Yi-An Ma, Jacob R. Gardner

TL;DR

This work analyzes SGD when each component $f_i$ is defined as an intractable expectation, i.e., a finite sum over inexact components, and components are estimated via Monte Carlo with subsampling (doubly SGD). The authors derive a general variance bound that separates contributions from component variance, cross-component correlations, and subsampling noise, showing convergence under the ER and BV conditions even with dependent estimators; they also extend results to random reshuffling (RR), demonstrating improved iteration complexity in strongly convex settings. A key practical insight is that, under a fixed budget $m\times b$, increasing the minibatch size $b$ often reduces variance more effectively than increasing MC samples $m$, particularly when estimators are correlated, and RR can yield substantial gains in convergence speed. The simulations corroborate the theory, and the results offer guidance for allocating computational resources between minibatch size and Monte Carlo sampling in real-world finite-sum-with-infinite-data problems such as diffusion models and variational inference.

Abstract

Optimization objectives in the form of a sum of intractable expectations are rising in importance (e.g., diffusion models, variational autoencoders, and many more), a setting also known as "finite sum with infinite data." For these problems, a popular strategy is to employ SGD with doubly stochastic gradients (doubly SGD): the expectations are estimated using the gradient estimator of each component, while the sum is estimated by subsampling over these estimators. Despite its popularity, little is known about the convergence properties of doubly SGD, except under strong assumptions such as bounded variance. In this work, we establish the convergence of doubly SGD with independent minibatching and random reshuffling under general conditions, which encompasses dependent component gradient estimators. In particular, for dependent estimators, our analysis allows fined-grained analysis of the effect correlations. As a result, under a per-iteration computational budget of $b \times m$, where $b$ is the minibatch size and $m$ is the number of Monte Carlo samples, our analysis suggests where one should invest most of the budget in general. Furthermore, we prove that random reshuffling (RR) improves the complexity dependence on the subsampling noise.

Demystifying SGD with Doubly Stochastic Gradients

TL;DR

This work analyzes SGD when each component

is defined as an intractable expectation, i.e., a finite sum over inexact components, and components are estimated via Monte Carlo with subsampling (doubly SGD). The authors derive a general variance bound that separates contributions from component variance, cross-component correlations, and subsampling noise, showing convergence under the ER and BV conditions even with dependent estimators; they also extend results to random reshuffling (RR), demonstrating improved iteration complexity in strongly convex settings. A key practical insight is that, under a fixed budget

, increasing the minibatch size

often reduces variance more effectively than increasing MC samples

, particularly when estimators are correlated, and RR can yield substantial gains in convergence speed. The simulations corroborate the theory, and the results offer guidance for allocating computational resources between minibatch size and Monte Carlo sampling in real-world finite-sum-with-infinite-data problems such as diffusion models and variational inference.

Abstract

, where

is the minibatch size and

is the number of Monte Carlo samples, our analysis suggests where one should invest most of the budget in general. Furthermore, we prove that random reshuffling (RR) improves the complexity dependence on the subsampling noise.

Paper Structure (71 sections, 18 theorems, 190 equations, 3 figures, 2 tables)

This paper contains 71 sections, 18 theorems, 190 equations, 3 figures, 2 tables.

Introduction
Preliminaries
Notation
Stochastic Gradient Descent on Finite-Sums
Finite-Sum Problems.
Doubly Stochastic Gradients
Doubly Stochastic Gradient
Dependent Component Gradient Estimators.
Technical Assumptions on Gradient Estimators
ER Condition.
Why the ER condition?
BV Condition.
Convergence Guarantees for SGD
Sufficiency of ER and BV.
Why focus on strongly convex functions?
...and 56 more sections

Key Result

Proposition 1

Let $\rvvg$ satisfy $\mathrm{ER}\left(\mathcal{L}\right)$. Then, the $m$-sample i.i.d. average of $\rvvg$ satisfy $\mathrm{ER}\left(\mathcal{L}/m\right)$.

Figures (3)

Figure 1: Trade-off between $b$ and $m$ on the gradient variance $\mathrm{tr}\mathbb{V}\rvvg\left(\vx_*\right)$ under varying budgets $m \times b$. The problem is a finite sum of $d = 10$, $n=1024$ isotropic quadratics with smoothness constants sampled as $L_i \sim \text{Inv-Gamma}(1/2, 1/2)$ and stationary points sampled as $\vx^*_i \sim \mathcal{N}\left(\mathbf{0}_d, s^2 \mathbf{I}_d\right)$, where the gradient has additive noise of $\rvveta \sim \mathcal{N}\left(\mathbf{0}_d, \mathbf{I}_d\right)$. Larger $s$ means more heterogeneous data.
Figure 2: Implications between general gradient variance conditions for some unbiased estimator $\rvvg\left(\vx\right) = \nabla f \left(\vx; \rvveta\right)$ of $\nabla f\left(\vx\right) = \mathbb{E} \rvvg\left(\vx\right)$. The dashed arrows () hold if $f$ is further assumed to be QFG; the dotted arrow () holds if the integrand $f(\vx; \veta)$ is uniformly convex such that it is convex with respect to $\vx$ for any fixed $\veta$. (1), (5), (9), (13) are established by gower_sgd_2021; (2) is proven in \ref{['thm:sgisqv']}; (3) is proven in \ref{['thm:qviswg']}; (4) is proven in \ref{['thm:qvisqes']}; (7) is proven in \ref{['thm:usqes']}; (8) is proven in \ref{['thm:usces']}; (6) is proven by nguyen_sgd_2018 but we restate the proof in \ref{['thm:qesises']}; (11) is proven in \ref{['thm:cescer']}; (10), (12) hold trivially if $\vx_* \in \mathop{\mathrm{arg\,min}}\limits_{\vx \in \mathcal{X}} f\left(\vx\right)$ are all stationary points.
Figure 3: Implications of assumptions on the components $f_1, \ldots, f_n$ to the minibatch subsampling gradient estimator $\nabla f_{\rvB}$ of $F = \frac{1}{n}\left(f_1 + \ldots + f_n\right)$. (1), (4) are established by gower_sgd_2021, while (3) trivially follows from the fact that $\vx_*$-convexity is strictly weaker than (global) convexity, and (2) was established by gower_sgd_2019.

Theorems & Definitions (53)

Definition 1: Expected Residual; ER
Proposition 1
Definition 2: Bounded Gradient Variance
Remark 1
Remark 2
Remark 3
proof
Corollary 1
Remark 4: For dependent estimators, increasing $b$ also reduces component variance.
Remark 5
...and 43 more

Demystifying SGD with Doubly Stochastic Gradients

TL;DR

Abstract

Demystifying SGD with Doubly Stochastic Gradients

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (53)