Demystifying SGD with Doubly Stochastic Gradients
Kyurae Kim, Joohwan Ko, Yi-An Ma, Jacob R. Gardner
TL;DR
This work analyzes SGD when each component $f_i$ is defined as an intractable expectation, i.e., a finite sum over inexact components, and components are estimated via Monte Carlo with subsampling (doubly SGD). The authors derive a general variance bound that separates contributions from component variance, cross-component correlations, and subsampling noise, showing convergence under the ER and BV conditions even with dependent estimators; they also extend results to random reshuffling (RR), demonstrating improved iteration complexity in strongly convex settings. A key practical insight is that, under a fixed budget $m\times b$, increasing the minibatch size $b$ often reduces variance more effectively than increasing MC samples $m$, particularly when estimators are correlated, and RR can yield substantial gains in convergence speed. The simulations corroborate the theory, and the results offer guidance for allocating computational resources between minibatch size and Monte Carlo sampling in real-world finite-sum-with-infinite-data problems such as diffusion models and variational inference.
Abstract
Optimization objectives in the form of a sum of intractable expectations are rising in importance (e.g., diffusion models, variational autoencoders, and many more), a setting also known as "finite sum with infinite data." For these problems, a popular strategy is to employ SGD with doubly stochastic gradients (doubly SGD): the expectations are estimated using the gradient estimator of each component, while the sum is estimated by subsampling over these estimators. Despite its popularity, little is known about the convergence properties of doubly SGD, except under strong assumptions such as bounded variance. In this work, we establish the convergence of doubly SGD with independent minibatching and random reshuffling under general conditions, which encompasses dependent component gradient estimators. In particular, for dependent estimators, our analysis allows fined-grained analysis of the effect correlations. As a result, under a per-iteration computational budget of $b \times m$, where $b$ is the minibatch size and $m$ is the number of Monte Carlo samples, our analysis suggests where one should invest most of the budget in general. Furthermore, we prove that random reshuffling (RR) improves the complexity dependence on the subsampling noise.
