The case for and against fixed step-size: Stochastic approximation algorithms in optimization and machine learning

Caio Kalil Lauand; Ioannis Kontoyiannis; Sean Meyn

The case for and against fixed step-size: Stochastic approximation algorithms in optimization and machine learning

Caio Kalil Lauand, Ioannis Kontoyiannis, Sean Meyn

TL;DR

This paper analyzes stochastic approximation with a fixed step-size in a Markov-noise setting, addressing the root-finding problem $\bar{f}(θ^*)=0$ where $\bar{f}(θ)=\mathbb{E}[f(θ,Φ)]$. It establishes geometric ergodicity for the joint process, derives moment bounds and a bias-variance trade-off for constant gain, and characterizes the performance of Polyak–Ruppert averaging, including conditions under which the asymptotic covariance is minimized. The work also provides a linear-SA refined theory, showing exponential convergence and explicit bias/variance decompositions; it further presents a theory-without-convergence extension for additive-noise models and CLTs for empirical targets. Through SGD and TD-learning examples, the paper demonstrates the practical implications: constant gain can expedite global search in optimization with multiple local minima, while vanishing gain better controls bias in reinforcement learning, outlining avenues for bias control and understanding memory effects in stochastic optimization.

Abstract

Theory and application of stochastic approximation (SA) have become increasingly relevant due in part to applications in optimization and reinforcement learning. This paper takes a new look at SA with constant step-size $α>0$, defined by the recursion, $$θ_{n+1} = θ_{n}+ αf(θ_n,Φ_{n+1})$$ in which $θ_n\in\mathbb{R}^d$ and $\{Φ_{n}\}$ is a Markov chain. The goal is to approximately solve root finding problem $\bar{f}(θ^*)=0$, where $\bar{f}(θ)=\mathbb{E}[f(θ,Φ)]$ and $Φ$ has the steady-state distribution of $\{Φ_{n}\}$. The following conclusions are obtained under an ergodicity assumption on the Markov chain, compatible assumptions on $f$, and for $α>0$ sufficiently small: $\textbf{1.}$ The pair process $\{(θ_n,Φ_n)\}$ is geometrically ergodic in a topological sense. $\textbf{2.}$ For every $1\le p\le 4$, there is a constant $b_p$ such that $\limsup_{n\to\infty}\mathbb{E}[\|θ_n-θ^*\|^p]\le b_p α^{p/2}$ for each initial condition. $\textbf{3.}$ The Polyak-Ruppert-style averaged estimates $θ^{\text{PR}}_n=n^{-1}\sum_{k=1}^{n}θ_k$ converge to a limit $θ^{\text{PR}}_\infty$ almost surely and in mean square, which satisfies $θ^{\text{PR}}_\infty=θ^*+α\barΥ^*+O(α^2)$ for an identified non-random $\barΥ^*\in\mathbb{R}^d$. Moreover, the covariance is approximately optimal: The limiting covariance matrix of $θ^{\text {PR}}_n$ is approximately minimal in a matricial sense. The two main take-aways for practitioners are application-dependent. It is argued that, in applications to optimization, constant gain algorithms may be preferable even when the objective has multiple local minima; while a vanishing gain algorithm is preferable in applications to reinforcement learning due to the presence of bias.

The case for and against fixed step-size: Stochastic approximation algorithms in optimization and machine learning

TL;DR

This paper analyzes stochastic approximation with a fixed step-size in a Markov-noise setting, addressing the root-finding problem

where

. It establishes geometric ergodicity for the joint process, derives moment bounds and a bias-variance trade-off for constant gain, and characterizes the performance of Polyak–Ruppert averaging, including conditions under which the asymptotic covariance is minimized. The work also provides a linear-SA refined theory, showing exponential convergence and explicit bias/variance decompositions; it further presents a theory-without-convergence extension for additive-noise models and CLTs for empirical targets. Through SGD and TD-learning examples, the paper demonstrates the practical implications: constant gain can expedite global search in optimization with multiple local minima, while vanishing gain better controls bias in reinforcement learning, outlining avenues for bias control and understanding memory effects in stochastic optimization.

Abstract

, defined by the recursion,

in which

and

is a Markov chain. The goal is to approximately solve root finding problem

, where

and

has the steady-state distribution of

. The following conclusions are obtained under an ergodicity assumption on the Markov chain, compatible assumptions on

, and for

sufficiently small:

The pair process

is geometrically ergodic in a topological sense.

For every

, there is a constant

such that

for each initial condition.

The Polyak-Ruppert-style averaged estimates

converge to a limit

almost surely and in mean square, which satisfies

for an identified non-random

. Moreover, the covariance is approximately optimal: The limiting covariance matrix of

is approximately minimal in a matricial sense. The two main take-aways for practitioners are application-dependent. It is argued that, in applications to optimization, constant gain algorithms may be preferable even when the objective has multiple local minima; while a vanishing gain algorithm is preferable in applications to reinforcement learning due to the presence of bias.

Paper Structure (20 sections, 36 theorems, 207 equations, 8 figures)

This paper contains 20 sections, 36 theorems, 207 equations, 8 figures.

Introduction
Main Results
Preliminaries
Ergodicity and Lyapunov exponents
Linear stochastic approximation
Theory without convergence
Stochastic gradient descent
Examples of applications
Optimization
Reinforcement learning
Impact of statistical memory
Conclusions and further work
Appendices
Mean flow stability theory
Markov chain bounds
...and 5 more sections

Key Result

Proposition 2.1

[proposition]t:ODEatInftAndV Suppose that ${\widebar{f}}$ is Lipschitz continuous. If (A3${}^\circ$) holds then there exists a solution to (A3${}^\bullet$). Conversely, if (A3${}^\bullet$) holds, and the scaled vector field ${\widebar{f}}_{\infty}(\theta)$ exists for each $\theta \in \mathbb{R}^d$,

Figures (8)

Figure 1: Multiple local minima for the six-hump camel-back function
Figure 2: SGD for the camel back function. Large exploration and small gain: $\sigma^2_W = 400$ and $\alpha=0.02$.
Figure 3: SGD for the camel back function. Moderate exploration and large gain: $\sigma^2_W = 10$ and $\alpha=0.1$.
Figure 4: SGD for the modified Styblinski-Tang function using moderate exploration and small gain.
Figure 5: Performance of TD($\lambda$) learning with $\lambda=0$. (a) $L_2$ norm of estimation error for $\theta^{\text{\tiny\sf PR}}_N$ with the fixed step-size algorithm as a function of $\alpha$. (b) Histogram of estimation error for vanishing and constant steps-size algorithms for each dimension of $\theta^{\text{\tiny\sf PR}}_N$.
...and 3 more figures

Theorems & Definitions (36)

Proposition 2.1
Proposition 2.2
Lemma 2.3
Theorem 2.1
Corollary 2.4
Lemma 2.5
Theorem 2.2
Corollary 2.6
Theorem 3.1
Corollary 3.1
...and 26 more

The case for and against fixed step-size: Stochastic approximation algorithms in optimization and machine learning

TL;DR

Abstract

The case for and against fixed step-size: Stochastic approximation algorithms in optimization and machine learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (36)