Table of Contents
Fetching ...

Implicit Updates for Average-Reward Temporal Difference Learning

Hwanwoo Kim, Dongkyu Derek Cho, Eric Laber

TL;DR

This work tackles the known step-size sensitivity of average-reward TD($\lambda$) by introducing average-reward implicit TD($\lambda$), a fixed-point update that automatically shrinks effective step-sizes via the implicit scheme. The authors establish non-asymptotic finite-time error bounds under weaker conditions than prior analyses and preserve the same per-iteration complexity. They also include a projection stabilization step to ensure robust behavior, and provide comprehensive empirical evidence showing superior stability and efficiency across policy evaluation and control tasks. The results indicate that implicit updates enable more reliable and scalable average-reward learning in practical RL settings.

Abstract

Temporal difference (TD) learning is a cornerstone of reinforcement learning. In the average-reward setting, standard TD($λ$) is highly sensitive to the choice of step-size and thus requires careful tuning to maintain numerical stability. We introduce average-reward implicit TD($λ$), which employs an implicit fixed point update to provide data-adaptive stabilization while preserving the per iteration computational complexity of standard average-reward TD($λ$). In contrast to prior finite-time analyses of average-reward TD($λ$), which impose restrictive step-size conditions, we establish finite-time error bounds for the implicit variant under substantially weaker step-size requirements. Empirically, average-reward implicit TD($λ$) operates reliably over a much broader range of step-sizes and exhibits markedly improved numerical stability. This enables more efficient policy evaluation and policy learning, highlighting its effectiveness as a robust alternative to average-reward TD($λ$).

Implicit Updates for Average-Reward Temporal Difference Learning

TL;DR

This work tackles the known step-size sensitivity of average-reward TD() by introducing average-reward implicit TD(), a fixed-point update that automatically shrinks effective step-sizes via the implicit scheme. The authors establish non-asymptotic finite-time error bounds under weaker conditions than prior analyses and preserve the same per-iteration complexity. They also include a projection stabilization step to ensure robust behavior, and provide comprehensive empirical evidence showing superior stability and efficiency across policy evaluation and control tasks. The results indicate that implicit updates enable more reliable and scalable average-reward learning in practical RL settings.

Abstract

Temporal difference (TD) learning is a cornerstone of reinforcement learning. In the average-reward setting, standard TD() is highly sensitive to the choice of step-size and thus requires careful tuning to maintain numerical stability. We introduce average-reward implicit TD(), which employs an implicit fixed point update to provide data-adaptive stabilization while preserving the per iteration computational complexity of standard average-reward TD(). In contrast to prior finite-time analyses of average-reward TD(), which impose restrictive step-size conditions, we establish finite-time error bounds for the implicit variant under substantially weaker step-size requirements. Empirically, average-reward implicit TD() operates reliably over a much broader range of step-sizes and exhibits markedly improved numerical stability. This enables more efficient policy evaluation and policy learning, highlighting its effectiveness as a robust alternative to average-reward TD().

Paper Structure

This paper contains 41 sections, 24 theorems, 150 equations, 7 figures, 1 algorithm.

Key Result

Lemma 3.1

Average-reward implicit TD($\lambda$) updates given in eqn:param_update_im2 and eqn:param_update_im1 can be expressed as

Figures (7)

  • Figure 1: Sensitivity of average-reward TD($\lambda$) to the choice of step-size with exponential weighting parameter $\lambda = 0.25$ and step-size ratio $c_\alpha = 1.0$. The solid line denotes the mean, and the shaded region indicates the 95% confidence interval.
  • Figure 2: MRP experiment under constant step-size, with exponential weighting parameter and step-size ratio set to $(\lambda, c_{\alpha}) = (0.25, 1.0)$. The solid line represents the mean, and the shaded region denotes the 95% confidence interval. (Left) Loss value from step-size 0.1 to 3.0. (Right) Loss value over iterations with initial step-size $\beta_0 = 1.0$.
  • Figure 3: Boyan experiment with exponential weighting parameter and step-size ratio set to $(\lambda, c_{\alpha}) = (0.25, 1.0)$ under decaying step-size schedule $\beta_t = \beta_0 / (t+1)^{0.99}$. Solid lines denote the mean, and shaded regions represent 95% confidence intervals. (Left) Loss value with initial step-sizes ranging from 0.1 to 3.0. (Right) Loss value over iterations with initial step-size $\beta_0 = 1.5$.
  • Figure 4: Control experiment with exponential weighting parameter and step-size ratio parameter $(\lambda, c_{\alpha}) = (0.25, 1.0)$, under the decaying step-size schedule $\beta_t = \beta_0 / (t+400)^{0.99}$. Initial step-size ranges from 0.25 to 1.5. Solid lines denote the mean, and shaded regions represent 95% confidence intervals.
  • Figure 5: MRP experiment results under decaying step-size schedule $\beta_t = \beta_0 / (t+1)^{0.99}$, with exponential weighting parameter and step-size ratio set to $(\lambda, c_{\alpha}) = (0.25, 1.0)$. Solid lines denote the mean, and shaded regions represent 95% confidence intervals. (Left) Loss value for initial step-sizes from 0.1 to 3.0. (Right) Full trajectory of the loss value with initial step-size $\beta_0 = 1.8$.
  • ...and 2 more figures

Theorems & Definitions (27)

  • Lemma 3.1
  • Lemma 4.1: Lemma 2 of zhang2021finite
  • Definition 4.2: Mixing Time
  • Remark 4.3
  • Theorem 4.4
  • Theorem 4.5
  • Remark 4.6
  • Lemma B.1
  • Theorem B.2
  • Theorem B.3
  • ...and 17 more