Table of Contents
Fetching ...

Exponential Concentration in Stochastic Approximation

Kody Law, Neil Walton, Shangda Yang

TL;DR

It is proved that stochastic approximation algorithms, including Projected Stochastic Gradient Descent (PSGD), Kiefer-Wolfowitz, and Frank-Wolfe algorithms, exhibit exponential concentration near an optimum, which leads to faster convergence rates, notably linear convergence, and O(1/t) rates.

Abstract

We analyze the behavior of stochastic approximation algorithms where iterates, in expectation, progress towards an objective at each step. When progress is proportional to the step size of the algorithm, we prove exponential concentration bounds. These tail-bounds contrast asymptotic normality results, which are more frequently associated with stochastic approximation. The methods that we develop rely on a geometric ergodicity proof. This extends a result on Markov chains due to Hajek (1982) to the area of stochastic approximation algorithms. We apply our results to several different Stochastic Approximation algorithms, specifically Projected Stochastic Gradient Descent, Kiefer-Wolfowitz and Stochastic Frank-Wolfe algorithms. When applicable, our results prove faster $O(1/t)$ and linear convergence rates for Projected Stochastic Gradient Descent with a non-vanishing gradient.

Exponential Concentration in Stochastic Approximation

TL;DR

It is proved that stochastic approximation algorithms, including Projected Stochastic Gradient Descent (PSGD), Kiefer-Wolfowitz, and Frank-Wolfe algorithms, exhibit exponential concentration near an optimum, which leads to faster convergence rates, notably linear convergence, and O(1/t) rates.

Abstract

We analyze the behavior of stochastic approximation algorithms where iterates, in expectation, progress towards an objective at each step. When progress is proportional to the step size of the algorithm, we prove exponential concentration bounds. These tail-bounds contrast asymptotic normality results, which are more frequently associated with stochastic approximation. The methods that we develop rely on a geometric ergodicity proof. This extends a result on Markov chains due to Hajek (1982) to the area of stochastic approximation algorithms. We apply our results to several different Stochastic Approximation algorithms, specifically Projected Stochastic Gradient Descent, Kiefer-Wolfowitz and Stochastic Frank-Wolfe algorithms. When applicable, our results prove faster and linear convergence rates for Projected Stochastic Gradient Descent with a non-vanishing gradient.
Paper Structure (42 sections, 29 theorems, 178 equations, 11 figures)

This paper contains 42 sections, 29 theorems, 178 equations, 11 figures.

Key Result

Theorem 1

For learning rates of the form $\alpha_t = {a}/{(u+t)^\gamma}$ with $a,u>0$ and $\gamma \in [0,1]$, if Conditions fcond:1 and fcond:2 are satisfied by a stochastic approximation algorithm, then and for time independent constants $I$, $J$ and $K$.

Figures (11)

  • Figure 1: The above plots a simulation of a stochastic gradient descent algorithm with a constant step size on the function $f(x) = (x+1)^2$. Figure \ref{['fig:sub1']}: When the objective is unconstrained the density of the location of iterates is well approximated by a normal distribution with variance $\sigma^2 = O(\alpha)$, where $\alpha$ is the step size of the algorithm. The distance to the optimum is $O(\alpha^{1/2})$. Figure \ref{['fig:sub2']}: When the value of $x$ is constrained to the positive orthant the gradient no longer vanishes. The distribution of iterates away from zero now has an exponential decay with rate $\lambda = O(\alpha^{-1})$. So for step size, $\alpha$, the distance to the optimum is $O(\alpha)$. This paper proves that exponential concentration holds more generally for stochastic approximation procedures with non-vanishing gradients.
  • Figure 2: Under the gradient condition \ref{['cond:D1']}, the objective need not be convex nor continuously differentiable. We require the derivative in the direction of the optimum to be non-zero. Under convexity and sharpness \ref{['cond:Sharp']}, the envelope of the function is bounded below by a cone. Here condition \ref{['cond:D1']} is satisfied.
  • Figure 3: Convergence of Frank-Wolfe, PSGD, and Kiefer-Wolfowitz algorithms on the circle constraint example. Figure \ref{['fig:CircleConstraints']}: the black dot is the optimal solution $(7, 7)$. Figure \ref{['fig:CircleEp']}: The expectation is computed over 20 realizations. The stochastic gradients for Frank-Wolfe and PSGD are computed with batch size $B=10$. The parameter $v=0.8$ is chosen for Kiefer-Wolfowitz. The parameters of step size are chosen as $a=0.9, u=1$ and $\gamma=1$ such that $\alpha_t = 1/(1+t)$. The fitted slope is $-1.00$, $-1.00$ and $-1.10$ for Frank-Wolfe, PSGD and Kiefer-Wolfowitz.
  • Figure 4: Convergence of PSGD and Kiefer-Wolfowitz algorithms on the three spherical constraints problems. Figure \ref{['fig:ThreeBalls']}: the black dot is the optimal solution $(0, 0, \sqrt{3})$. Figure \ref{['fig:ThreeBallsAlg']}: The expectation is computed over 20 realizations. The stochastic gradients for PSGD are computed with $B=10$. The parameter $v = 1$ is chosen for Kiefer-Wolfowitz. The parameters of step size are chosen as $a=1, u=1$ and $\gamma=1$ such that $\alpha_t = 1/(1+t)$. The fitted slope is -1.01 and -1.00 for PSGD and Kiefer-Wolfowitz.
  • Figure 5: Linear convergence of PSGD and Kiefer-Wolfowitz for the three spherical constraints problem. The expectation is computed over 20 realizations. The stochastic gradients are computed with $B=10$. The parameter $v=10$ is chosen for Kiefer-Wolfowitz. The simulations are conducted with learning rates divided by 2 every 20 steps.
  • ...and 6 more figures

Theorems & Definitions (51)

  • Theorem 1
  • Theorem 2
  • Remark 1: Convexity and Sharpness.
  • Lemma 1
  • Remark 2: Smooth Functional Constraints
  • Lemma 2
  • Remark 3: Projection
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • ...and 41 more