The case for and against fixed step-size: Stochastic approximation algorithms in optimization and machine learning
Caio Kalil Lauand, Ioannis Kontoyiannis, Sean Meyn
TL;DR
This paper analyzes stochastic approximation with a fixed step-size in a Markov-noise setting, addressing the root-finding problem $\bar{f}(θ^*)=0$ where $\bar{f}(θ)=\mathbb{E}[f(θ,Φ)]$. It establishes geometric ergodicity for the joint process, derives moment bounds and a bias-variance trade-off for constant gain, and characterizes the performance of Polyak–Ruppert averaging, including conditions under which the asymptotic covariance is minimized. The work also provides a linear-SA refined theory, showing exponential convergence and explicit bias/variance decompositions; it further presents a theory-without-convergence extension for additive-noise models and CLTs for empirical targets. Through SGD and TD-learning examples, the paper demonstrates the practical implications: constant gain can expedite global search in optimization with multiple local minima, while vanishing gain better controls bias in reinforcement learning, outlining avenues for bias control and understanding memory effects in stochastic optimization.
Abstract
Theory and application of stochastic approximation (SA) have become increasingly relevant due in part to applications in optimization and reinforcement learning. This paper takes a new look at SA with constant step-size $α>0$, defined by the recursion, $$θ_{n+1} = θ_{n}+ αf(θ_n,Φ_{n+1})$$ in which $θ_n\in\mathbb{R}^d$ and $\{Φ_{n}\}$ is a Markov chain. The goal is to approximately solve root finding problem $\bar{f}(θ^*)=0$, where $\bar{f}(θ)=\mathbb{E}[f(θ,Φ)]$ and $Φ$ has the steady-state distribution of $\{Φ_{n}\}$. The following conclusions are obtained under an ergodicity assumption on the Markov chain, compatible assumptions on $f$, and for $α>0$ sufficiently small: $\textbf{1.}$ The pair process $\{(θ_n,Φ_n)\}$ is geometrically ergodic in a topological sense. $\textbf{2.}$ For every $1\le p\le 4$, there is a constant $b_p$ such that $\limsup_{n\to\infty}\mathbb{E}[\|θ_n-θ^*\|^p]\le b_p α^{p/2}$ for each initial condition. $\textbf{3.}$ The Polyak-Ruppert-style averaged estimates $θ^{\text{PR}}_n=n^{-1}\sum_{k=1}^{n}θ_k$ converge to a limit $θ^{\text{PR}}_\infty$ almost surely and in mean square, which satisfies $θ^{\text{PR}}_\infty=θ^*+α\barΥ^*+O(α^2)$ for an identified non-random $\barΥ^*\in\mathbb{R}^d$. Moreover, the covariance is approximately optimal: The limiting covariance matrix of $θ^{\text {PR}}_n$ is approximately minimal in a matricial sense. The two main take-aways for practitioners are application-dependent. It is argued that, in applications to optimization, constant gain algorithms may be preferable even when the objective has multiple local minima; while a vanishing gain algorithm is preferable in applications to reinforcement learning due to the presence of bias.
