Table of Contents
Fetching ...

Non-asymptotic Analysis of Biased Adaptive Stochastic Approximation

Sobihan Surendran, Antoine Godichon-Baggioni, Adeline Fermanian, Sylvain Le Corff

TL;DR

This work develops a non-asymptotic theory for biased adaptive stochastic approximation in non-convex optimization. By formalizing a BASA framework that accommodates adaptive updates (e.g., Adagrad, RMSProp, AMSGRAD) and time-varying gradient bias, the authors prove convergence to critical points and, under the PL condition, linear rates. They derive explicit rates such as $O(\log n/\sqrt{n} + b_n)$ without PL and provide bias-dependent refinements for practical algorithms; these results apply to biased estimators in bilevel/conditional optimization and IWAE-type models. Experiments on variational autoencoders (VAE/IWAE/BR-IWAE) illustrate the bias effects and confirm the theoretical predictions, guiding hyperparameter choices to balance convergence speed and computational cost.

Abstract

Stochastic Gradient Descent (SGD) with adaptive steps is widely used to train deep neural networks and generative models. Most theoretical results assume that it is possible to obtain unbiased gradient estimators, which is not the case in several recent deep learning and reinforcement learning applications that use Monte Carlo methods. This paper provides a comprehensive non-asymptotic analysis of SGD with biased gradients and adaptive steps for non-convex smooth functions. Our study incorporates time-dependent bias and emphasizes the importance of controlling the bias of the gradient estimator. In particular, we establish that Adagrad, RMSProp, and AMSGRAD, an exponential moving average variant of Adam, with biased gradients, converge to critical points for smooth non-convex functions at a rate similar to existing results in the literature for the unbiased case. Finally, we provide experimental results using Variational Autoenconders (VAE) and applications to several learning frameworks that illustrate our convergence results and show how the effect of bias can be reduced by appropriate hyperparameter tuning.

Non-asymptotic Analysis of Biased Adaptive Stochastic Approximation

TL;DR

This work develops a non-asymptotic theory for biased adaptive stochastic approximation in non-convex optimization. By formalizing a BASA framework that accommodates adaptive updates (e.g., Adagrad, RMSProp, AMSGRAD) and time-varying gradient bias, the authors prove convergence to critical points and, under the PL condition, linear rates. They derive explicit rates such as without PL and provide bias-dependent refinements for practical algorithms; these results apply to biased estimators in bilevel/conditional optimization and IWAE-type models. Experiments on variational autoencoders (VAE/IWAE/BR-IWAE) illustrate the bias effects and confirm the theoretical predictions, guiding hyperparameter choices to balance convergence speed and computational cost.

Abstract

Stochastic Gradient Descent (SGD) with adaptive steps is widely used to train deep neural networks and generative models. Most theoretical results assume that it is possible to obtain unbiased gradient estimators, which is not the case in several recent deep learning and reinforcement learning applications that use Monte Carlo methods. This paper provides a comprehensive non-asymptotic analysis of SGD with biased gradients and adaptive steps for non-convex smooth functions. Our study incorporates time-dependent bias and emphasizes the importance of controlling the bias of the gradient estimator. In particular, we establish that Adagrad, RMSProp, and AMSGRAD, an exponential moving average variant of Adam, with biased gradients, converge to critical points for smooth non-convex functions at a rate similar to existing results in the literature for the unbiased case. Finally, we provide experimental results using Variational Autoenconders (VAE) and applications to several learning frameworks that illustrate our convergence results and show how the effect of bias can be reduced by appropriate hyperparameter tuning.
Paper Structure (42 sections, 17 theorems, 153 equations, 11 figures, 3 tables, 4 algorithms)

This paper contains 42 sections, 17 theorems, 153 equations, 11 figures, 3 tables, 4 algorithms.

Key Result

Theorem 4.1

Assume that ass:A1 - ass:A4 hold. Let $\theta_{n} \in \mathbb{R}^{d}$ be the $n$-th iterate of the recursion ASA and $\gamma_{n} = C_{\gamma}n^{-\gamma}, \beta_{n} = C_{\beta}n^{\beta}, \lambda_{n} = C_{\lambda}n^{-\lambda}$ with $C_{\gamma}>0, C_{\beta}>0$, and $C_{\lambda}>0$. Assume that $\gamma,

Figures (11)

  • Figure 1: Negative Log-Likelihood on the test set for Different Generative Models with Adagrad, RMSProp, and Adam on CIFAR-10. Bold lines represent the mean over 5 independent runs.
  • Figure 2: Value of $\| \nabla V(\theta_n) \|^{2}$ in IWAE with Adagrad (on the left), RMSProp, and Adam (on the right). Bold lines represent the mean over 5 independent runs. Figures are plotted on a logarithmic scale for better visualization. Both figures have the same scale, so we have not shown the dashed theoretical curves on the right for better clarity.
  • Figure 3: Value of $V(\theta_n) - V(\theta^*)$ (on the left) and $\| \nabla V(\theta_n) \|^{2}$ (on the right) with Adagrad for different values of $r_n=n^{-r}$ and a learning rate $\gamma_n=n^{-1/2}$. The dashed curve corresponds to the expected convergence rate $\mathcal{O}(n^{-1/4})$ for $r = 1/4$ and $\mathcal{O}(n^{-1/2})$ for $r \geq 1/2$.
  • Figure 4: IWAE on the FashionMNIST Dataset with Adagrad for different values of $\alpha$. Bold lines represent the mean over 5 independent runs.
  • Figure 5: IWAE on the FashionMNIST Dataset with RMSProp for different values of $\alpha$. Bold lines represent the mean over 5 independent runs.
  • ...and 6 more figures

Theorems & Definitions (24)

  • Theorem 4.1
  • Theorem 4.2
  • Corollary 4.3
  • Corollary 4.4
  • Corollary 4.5
  • Theorem 4.6
  • Lemma A.1
  • Theorem A.2
  • proof
  • Lemma A.3
  • ...and 14 more