Table of Contents
Fetching ...

Two-Timescale Linear Stochastic Approximation: Constant Stepsizes Go a Long Way

Jeongyeol Kwon, Luke Dotson, Yudong Chen, Qiaomin Xie

TL;DR

This work analyzes two-timescale stochastic approximation with constant stepsizes under Markovian noise, proving that the joint process converges to a unique stationary distribution in Wasserstein distance and deriving explicit, non-asymptotic bias-variance characterizations for both iterates. Importantly, it shows that the slower and faster updates exhibit linear-in-step-size biases and variance scaling with their own stepsizes, without requiring restrictive conditions such as $β^2 leq α$ or dimension-dependent constants. The authors further introduce tail-averaging and Richardson-Romberg extrapolation to reduce variance and bias, achieving an MSE bound of $O_{ ext{P}}(β^4 + 1/t)$ for both iterates. These results enable practical variance reduction techniques in TTSA and provide a nuanced understanding of constant-step TTSA performance with Markovian noise, with potential impact on reinforcement learning algorithms and stochastic bilevel optimization.

Abstract

Previous studies on two-timescale stochastic approximation (SA) mainly focused on bounding mean-squared errors under diminishing stepsize schemes. In this work, we investigate {\it constant} stpesize schemes through the lens of Markov processes, proving that the iterates of both timescales converge to a unique joint stationary distribution in Wasserstein metric. We derive explicit geometric and non-asymptotic convergence rates, as well as the variance and bias introduced by constant stepsizes in the presence of Markovian noise. Specifically, with two constant stepsizes $α< β$, we show that the biases scale linearly with both stepsizes as $Θ(α)+Θ(β)$ up to higher-order terms, while the variance of the slower iterate (resp., faster iterate) scales only with its own stepsize as $O(α)$ (resp., $O(β)$). Unlike previous work, our results require no additional assumptions such as $β^2 \ll α$ nor extra dependence on dimensions. These fine-grained characterizations allow tail-averaging and extrapolation techniques to reduce variance and bias, improving mean-squared error bound to $O(β^4 + \frac{1}{t})$ for both iterates.

Two-Timescale Linear Stochastic Approximation: Constant Stepsizes Go a Long Way

TL;DR

This work analyzes two-timescale stochastic approximation with constant stepsizes under Markovian noise, proving that the joint process converges to a unique stationary distribution in Wasserstein distance and deriving explicit, non-asymptotic bias-variance characterizations for both iterates. Importantly, it shows that the slower and faster updates exhibit linear-in-step-size biases and variance scaling with their own stepsizes, without requiring restrictive conditions such as or dimension-dependent constants. The authors further introduce tail-averaging and Richardson-Romberg extrapolation to reduce variance and bias, achieving an MSE bound of for both iterates. These results enable practical variance reduction techniques in TTSA and provide a nuanced understanding of constant-step TTSA performance with Markovian noise, with potential impact on reinforcement learning algorithms and stochastic bilevel optimization.

Abstract

Previous studies on two-timescale stochastic approximation (SA) mainly focused on bounding mean-squared errors under diminishing stepsize schemes. In this work, we investigate {\it constant} stpesize schemes through the lens of Markov processes, proving that the iterates of both timescales converge to a unique joint stationary distribution in Wasserstein metric. We derive explicit geometric and non-asymptotic convergence rates, as well as the variance and bias introduced by constant stepsizes in the presence of Markovian noise. Specifically, with two constant stepsizes , we show that the biases scale linearly with both stepsizes as up to higher-order terms, while the variance of the slower iterate (resp., faster iterate) scales only with its own stepsize as (resp., ). Unlike previous work, our results require no additional assumptions such as nor extra dependence on dimensions. These fine-grained characterizations allow tail-averaging and extrapolation techniques to reduce variance and bias, improving mean-squared error bound to for both iterates.

Paper Structure

This paper contains 56 sections, 23 theorems, 191 equations, 4 figures.

Key Result

Lemma 3.1

Let $\bar{x}_t = x_t - x^*$, $\bar{y}_t = y_t - y^*(x_t)$. Then equation eq:basic_ttsa_equation can be rewritten as:

Figures (4)

  • Figure 1: Bias (top) and variance (bottom) versus $\beta$ at different $\alpha$ for the slower iterate $x_t$.
  • Figure 2: Bias (top) and variance (bottom) versus $\beta$ at different $\alpha$ for the faster iterate $y_t$.
  • Figure 3: Comparison of Tail-Averaging (TA) and Richard-Romberg (RR) extrapolation in $\beta$.
  • Figure 4: Comparison of Tail-Averaging (TA), RR extrapolation in $\beta$, and RR extrapolation in both $\beta$ and $\alpha$.

Theorems & Definitions (26)

  • Definition 1
  • Lemma 3.1
  • Theorem 3.2
  • Theorem 3.3
  • Corollary 3.4
  • Theorem 3.5
  • Corollary 3.6
  • Remark 1
  • Remark 2
  • Lemma A.1
  • ...and 16 more