Table of Contents
Fetching ...

Accelerated stochastic approximation with state-dependent noise

Sasila Ilandarideva, Anatoli Juditsky, Guanghui Lan, Tianjiao Li

TL;DR

This work develops accelerated stochastic approximation methods for convex optimization with state-dependent gradient noise, a setting motivated by generalized linear regression and sparse recovery. It presents two non-Euclidean algorithms, SAGD and SGE, and proves that SGE, in particular, achieves optimal iteration and sample complexities under broader noise assumptions than SAGD, including heavy-tailed or discontinuous gradient observations. A multi-stage restart framework for SGE is proposed to handle problems with quadratic growth, and a sparse-recovery variant (SGE-SR) is developed that combines hard-thresholding with accelerated updates to attain favorable dependence on sparsity. Theoretical results are complemented by numerical experiments demonstrating the efficacy of the proposed methods in high-dimensional, structured settings. The work advances the understanding of stochastic optimization with state-dependent noise and offers practical algorithms with strong convergence guarantees for statistical estimation and sparse recovery problems.

Abstract

We consider a class of stochastic smooth convex optimization problems under rather general assumptions on the noise in the stochastic gradient observation. As opposed to the classical problem setting in which the variance of noise is assumed to be uniformly bounded, herein we assume that the variance of stochastic gradients is related to the "sub-optimality" of the approximate solutions delivered by the algorithm. Such problems naturally arise in a variety of applications, in particular, in the well-known generalized linear regression problem in statistics. However, to the best of our knowledge, none of the existing stochastic approximation algorithms for solving this class of problems attain optimality in terms of the dependence on accuracy, problem parameters, and mini-batch size. We discuss two non-Euclidean accelerated stochastic approximation routines--stochastic accelerated gradient descent (SAGD) and stochastic gradient extrapolation (SGE)--which carry a particular duality relationship. We show that both SAGD and SGE, under appropriate conditions, achieve the optimal convergence rate, attaining the optimal iteration and sample complexities simultaneously. However, corresponding assumptions for the SGE algorithm are more general; they allow, for instance, for efficient application of the SGE to statistical estimation problems under heavy tail noises and discontinuous score functions. We also discuss the application of the SGE to problems satisfying quadratic growth conditions, and show how it can be used to recover sparse solutions. Finally, we report on some simulation experiments to illustrate numerical performance of our proposed algorithms in high-dimensional settings.

Accelerated stochastic approximation with state-dependent noise

TL;DR

This work develops accelerated stochastic approximation methods for convex optimization with state-dependent gradient noise, a setting motivated by generalized linear regression and sparse recovery. It presents two non-Euclidean algorithms, SAGD and SGE, and proves that SGE, in particular, achieves optimal iteration and sample complexities under broader noise assumptions than SAGD, including heavy-tailed or discontinuous gradient observations. A multi-stage restart framework for SGE is proposed to handle problems with quadratic growth, and a sparse-recovery variant (SGE-SR) is developed that combines hard-thresholding with accelerated updates to attain favorable dependence on sparsity. Theoretical results are complemented by numerical experiments demonstrating the efficacy of the proposed methods in high-dimensional, structured settings. The work advances the understanding of stochastic optimization with state-dependent noise and offers practical algorithms with strong convergence guarantees for statistical estimation and sparse recovery problems.

Abstract

We consider a class of stochastic smooth convex optimization problems under rather general assumptions on the noise in the stochastic gradient observation. As opposed to the classical problem setting in which the variance of noise is assumed to be uniformly bounded, herein we assume that the variance of stochastic gradients is related to the "sub-optimality" of the approximate solutions delivered by the algorithm. Such problems naturally arise in a variety of applications, in particular, in the well-known generalized linear regression problem in statistics. However, to the best of our knowledge, none of the existing stochastic approximation algorithms for solving this class of problems attain optimality in terms of the dependence on accuracy, problem parameters, and mini-batch size. We discuss two non-Euclidean accelerated stochastic approximation routines--stochastic accelerated gradient descent (SAGD) and stochastic gradient extrapolation (SGE)--which carry a particular duality relationship. We show that both SAGD and SGE, under appropriate conditions, achieve the optimal convergence rate, attaining the optimal iteration and sample complexities simultaneously. However, corresponding assumptions for the SGE algorithm are more general; they allow, for instance, for efficient application of the SGE to statistical estimation problems under heavy tail noises and discontinuous score functions. We also discuss the application of the SGE to problems satisfying quadratic growth conditions, and show how it can be used to recover sparse solutions. Finally, we report on some simulation experiments to illustrate numerical performance of our proposed algorithms in high-dimensional settings.
Paper Structure (46 sections, 12 theorems, 147 equations, 6 figures, 4 algorithms)

This paper contains 46 sections, 12 theorems, 147 equations, 6 figures, 4 algorithms.

Key Result

lemma thmcounterlemma

Denote (we use the shorthand notation $V_{x}^*(y)$ when $x_0$ is clear in content). The mini-batch estimator $G_t$ satisfies for any $x_0,x$, and $u\in X$ and $\gamma \in \Bbb{R}$, Consequently, when Assumption (assump:variance) holds,

Figures (6)

  • Figure 1: Left plot: variance of the stochastic oracle ${\@fontswitch\mathcal{G}}_1(x,\xi)$ as function of $x$ (solid line) and upper bound $4\sigma^2(0)+3[f(x)-f(0)]$ (dashed line); right plot: variance of ${\@fontswitch\mathcal{G}}_2(x,\xi)$ as function of $x$ (solid line) and upper bound $2.3\sigma^2(0)+3[f(x)-f(0)]$ (dashed line).
  • Figure 2: Activation functions
  • Figure 3: Estimation error $\|x_t -x^*\|_2$ against the number of stochastic oracle calls for SGE-SR and SMD-SR algorithms. In the left, middle, and right columns of the plot we show results for the linear activation $u_{1}$, and nonlinear $u_{1/2}$ and $u_{1/10}$, respectively. Two figure rows correspond to two different noise levels, $\sigma=0.1$ (the upper row) and $\sigma=0.001$ (the bottom row). The legend specifies the value $m_0$ of the batch size of the preliminary phase of the algorithm for both routines.
  • Figure 4: Estimation error $\|x_t -x^*\|_2$ against the number of stochastic oracle calls for SGE-SR in Gaussian (light-tail) and Student $t_5$ (heavy-tail) regressor and noise generation setups. In the left, middle, and right columns of the plot we show results for the linear activation $u_{1}$, and nonlinear $u_{1/2}$ and $u_{1/10}$, respectively. Two figure rows correspond to two different noise levels, $\sigma=0.1$ (the upper row) and $\sigma=0.001$ (the bottom row). The legend specifies the value $m_0$ of the batch size of the preliminary phase of the algorithm for both routines.
  • Figure 5: Error $\|x_t -x^*\|_2$ of the SGE-SR algorithm for three values of the problem condition number. First row: algorithm error against the number of oracle calls; second row: error against the number of algorithm iterations. Figure columns correspond to the results for $u_1$, $u_{1/2}$, and $u_{1/10}$ activation functions and $\sigma=0.001$.
  • ...and 1 more figures

Theorems & Definitions (12)

  • lemma thmcounterlemma
  • theorem 1
  • corollary thmcountercorollary
  • lemma thmcounterlemma
  • theorem 2
  • corollary thmcountercorollary
  • theorem 3
  • corollary thmcountercorollary
  • corollary thmcountercorollary
  • corollary thmcountercorollary
  • ...and 2 more