Table of Contents
Fetching ...

The case for and against fixed step-size: Stochastic approximation algorithms in optimization and machine learning

Caio Kalil Lauand, Ioannis Kontoyiannis, Sean Meyn

TL;DR

This paper analyzes stochastic approximation with a fixed step-size in a Markov-noise setting, addressing the root-finding problem $\bar{f}(θ^*)=0$ where $\bar{f}(θ)=\mathbb{E}[f(θ,Φ)]$. It establishes geometric ergodicity for the joint process, derives moment bounds and a bias-variance trade-off for constant gain, and characterizes the performance of Polyak–Ruppert averaging, including conditions under which the asymptotic covariance is minimized. The work also provides a linear-SA refined theory, showing exponential convergence and explicit bias/variance decompositions; it further presents a theory-without-convergence extension for additive-noise models and CLTs for empirical targets. Through SGD and TD-learning examples, the paper demonstrates the practical implications: constant gain can expedite global search in optimization with multiple local minima, while vanishing gain better controls bias in reinforcement learning, outlining avenues for bias control and understanding memory effects in stochastic optimization.

Abstract

Theory and application of stochastic approximation (SA) have become increasingly relevant due in part to applications in optimization and reinforcement learning. This paper takes a new look at SA with constant step-size $α>0$, defined by the recursion, $$θ_{n+1} = θ_{n}+ αf(θ_n,Φ_{n+1})$$ in which $θ_n\in\mathbb{R}^d$ and $\{Φ_{n}\}$ is a Markov chain. The goal is to approximately solve root finding problem $\bar{f}(θ^*)=0$, where $\bar{f}(θ)=\mathbb{E}[f(θ,Φ)]$ and $Φ$ has the steady-state distribution of $\{Φ_{n}\}$. The following conclusions are obtained under an ergodicity assumption on the Markov chain, compatible assumptions on $f$, and for $α>0$ sufficiently small: $\textbf{1.}$ The pair process $\{(θ_n,Φ_n)\}$ is geometrically ergodic in a topological sense. $\textbf{2.}$ For every $1\le p\le 4$, there is a constant $b_p$ such that $\limsup_{n\to\infty}\mathbb{E}[\|θ_n-θ^*\|^p]\le b_p α^{p/2}$ for each initial condition. $\textbf{3.}$ The Polyak-Ruppert-style averaged estimates $θ^{\text{PR}}_n=n^{-1}\sum_{k=1}^{n}θ_k$ converge to a limit $θ^{\text{PR}}_\infty$ almost surely and in mean square, which satisfies $θ^{\text{PR}}_\infty=θ^*+α\barΥ^*+O(α^2)$ for an identified non-random $\barΥ^*\in\mathbb{R}^d$. Moreover, the covariance is approximately optimal: The limiting covariance matrix of $θ^{\text {PR}}_n$ is approximately minimal in a matricial sense. The two main take-aways for practitioners are application-dependent. It is argued that, in applications to optimization, constant gain algorithms may be preferable even when the objective has multiple local minima; while a vanishing gain algorithm is preferable in applications to reinforcement learning due to the presence of bias.

The case for and against fixed step-size: Stochastic approximation algorithms in optimization and machine learning

TL;DR

This paper analyzes stochastic approximation with a fixed step-size in a Markov-noise setting, addressing the root-finding problem where . It establishes geometric ergodicity for the joint process, derives moment bounds and a bias-variance trade-off for constant gain, and characterizes the performance of Polyak–Ruppert averaging, including conditions under which the asymptotic covariance is minimized. The work also provides a linear-SA refined theory, showing exponential convergence and explicit bias/variance decompositions; it further presents a theory-without-convergence extension for additive-noise models and CLTs for empirical targets. Through SGD and TD-learning examples, the paper demonstrates the practical implications: constant gain can expedite global search in optimization with multiple local minima, while vanishing gain better controls bias in reinforcement learning, outlining avenues for bias control and understanding memory effects in stochastic optimization.

Abstract

Theory and application of stochastic approximation (SA) have become increasingly relevant due in part to applications in optimization and reinforcement learning. This paper takes a new look at SA with constant step-size , defined by the recursion, in which and is a Markov chain. The goal is to approximately solve root finding problem , where and has the steady-state distribution of . The following conclusions are obtained under an ergodicity assumption on the Markov chain, compatible assumptions on , and for sufficiently small: The pair process is geometrically ergodic in a topological sense. For every , there is a constant such that for each initial condition. The Polyak-Ruppert-style averaged estimates converge to a limit almost surely and in mean square, which satisfies for an identified non-random . Moreover, the covariance is approximately optimal: The limiting covariance matrix of is approximately minimal in a matricial sense. The two main take-aways for practitioners are application-dependent. It is argued that, in applications to optimization, constant gain algorithms may be preferable even when the objective has multiple local minima; while a vanishing gain algorithm is preferable in applications to reinforcement learning due to the presence of bias.
Paper Structure (20 sections, 36 theorems, 207 equations, 8 figures)

This paper contains 20 sections, 36 theorems, 207 equations, 8 figures.

Key Result

Proposition 2.1

[proposition]t:ODEatInftAndV Suppose that ${\widebar{f}}$ is Lipschitz continuous. If (A3${}^\circ$) holds then there exists a solution to (A3${}^\bullet$). Conversely, if (A3${}^\bullet$) holds, and the scaled vector field ${\widebar{f}}_{\infty}(\theta)$ exists for each $\theta \in \mathbb{R}^d$,

Figures (8)

  • Figure 1: Multiple local minima for the six-hump camel-back function
  • Figure 2: SGD for the camel back function. Large exploration and small gain: $\sigma^2_W = 400$ and $\alpha=0.02$.
  • Figure 3: SGD for the camel back function. Moderate exploration and large gain: $\sigma^2_W = 10$ and $\alpha=0.1$.
  • Figure 4: SGD for the modified Styblinski-Tang function using moderate exploration and small gain.
  • Figure 5: Performance of TD($\lambda$) learning with $\lambda=0$. (a) $L_2$ norm of estimation error for $\theta^{\text{\tiny\sf PR}}_N$ with the fixed step-size algorithm as a function of $\alpha$. (b) Histogram of estimation error for vanishing and constant steps-size algorithms for each dimension of $\theta^{\text{\tiny\sf PR}}_N$.
  • ...and 3 more figures

Theorems & Definitions (36)

  • Proposition 2.1
  • Proposition 2.2
  • Lemma 2.3
  • Theorem 2.1
  • Corollary 2.4
  • Lemma 2.5
  • Theorem 2.2
  • Corollary 2.6
  • Theorem 3.1
  • Corollary 3.1
  • ...and 26 more