Implicit Updates for Average-Reward Temporal Difference Learning
Hwanwoo Kim, Dongkyu Derek Cho, Eric Laber
TL;DR
This work tackles the known step-size sensitivity of average-reward TD($\lambda$) by introducing average-reward implicit TD($\lambda$), a fixed-point update that automatically shrinks effective step-sizes via the implicit scheme. The authors establish non-asymptotic finite-time error bounds under weaker conditions than prior analyses and preserve the same per-iteration complexity. They also include a projection stabilization step to ensure robust behavior, and provide comprehensive empirical evidence showing superior stability and efficiency across policy evaluation and control tasks. The results indicate that implicit updates enable more reliable and scalable average-reward learning in practical RL settings.
Abstract
Temporal difference (TD) learning is a cornerstone of reinforcement learning. In the average-reward setting, standard TD($λ$) is highly sensitive to the choice of step-size and thus requires careful tuning to maintain numerical stability. We introduce average-reward implicit TD($λ$), which employs an implicit fixed point update to provide data-adaptive stabilization while preserving the per iteration computational complexity of standard average-reward TD($λ$). In contrast to prior finite-time analyses of average-reward TD($λ$), which impose restrictive step-size conditions, we establish finite-time error bounds for the implicit variant under substantially weaker step-size requirements. Empirically, average-reward implicit TD($λ$) operates reliably over a much broader range of step-sizes and exhibits markedly improved numerical stability. This enables more efficient policy evaluation and policy learning, highlighting its effectiveness as a robust alternative to average-reward TD($λ$).
