Two-Timescale Linear Stochastic Approximation: Constant Stepsizes Go a Long Way
Jeongyeol Kwon, Luke Dotson, Yudong Chen, Qiaomin Xie
TL;DR
This work analyzes two-timescale stochastic approximation with constant stepsizes under Markovian noise, proving that the joint process converges to a unique stationary distribution in Wasserstein distance and deriving explicit, non-asymptotic bias-variance characterizations for both iterates. Importantly, it shows that the slower and faster updates exhibit linear-in-step-size biases and variance scaling with their own stepsizes, without requiring restrictive conditions such as $β^2 leq α$ or dimension-dependent constants. The authors further introduce tail-averaging and Richardson-Romberg extrapolation to reduce variance and bias, achieving an MSE bound of $O_{ ext{P}}(β^4 + 1/t)$ for both iterates. These results enable practical variance reduction techniques in TTSA and provide a nuanced understanding of constant-step TTSA performance with Markovian noise, with potential impact on reinforcement learning algorithms and stochastic bilevel optimization.
Abstract
Previous studies on two-timescale stochastic approximation (SA) mainly focused on bounding mean-squared errors under diminishing stepsize schemes. In this work, we investigate {\it constant} stpesize schemes through the lens of Markov processes, proving that the iterates of both timescales converge to a unique joint stationary distribution in Wasserstein metric. We derive explicit geometric and non-asymptotic convergence rates, as well as the variance and bias introduced by constant stepsizes in the presence of Markovian noise. Specifically, with two constant stepsizes $α< β$, we show that the biases scale linearly with both stepsizes as $Θ(α)+Θ(β)$ up to higher-order terms, while the variance of the slower iterate (resp., faster iterate) scales only with its own stepsize as $O(α)$ (resp., $O(β)$). Unlike previous work, our results require no additional assumptions such as $β^2 \ll α$ nor extra dependence on dimensions. These fine-grained characterizations allow tail-averaging and extrapolation techniques to reduce variance and bias, improving mean-squared error bound to $O(β^4 + \frac{1}{t})$ for both iterates.
