Heavy-Ball Momentum Accelerated Actor-Critic With Function Approximation
Yanjie Dong, Haijun Zhang, Gang Wang, Shisheng Cui, Xiping Hu
TL;DR
The paper addresses variance and slow convergence in policy-gradient methods for continuous-state RL by introducing a heavy-ball momentum term into the critic update of an actor-critic framework (HB-A2C). It uses multi-step bootstrapping and a two-timescale online scheme to update actor and critic concurrently, with a linear function approximator for the value function. A new analytical framework bounds gradient bias and optimality drift under Markovian noise, showing that the unified actor-critic recursions converge to an $\epsilon$-stationary point at a rate of ${\cal O}(1/\sqrt{K})$ (with additional ${\cal O}(1/K)$ terms) when the learning rates scale as $\alpha=\Theta(1/\sqrt{K})$ and $\beta=c_5\alpha$. The results reveal how the momentum factor and trajectory length influence convergence, and they provide finite-time guarantees without requiring decaying variance. This work thus offers a principled, momentum-accelerated approach for RL with function approximation in online, Markovian settings, potentially improving data efficiency and convergence speed in continuous control tasks.
Abstract
By using an parametric value function to replace the Monte-Carlo rollouts for value estimation, the actor-critic (AC) algorithms can reduce the variance of stochastic policy gradient so that to improve the convergence rate. While existing works mainly focus on analyzing convergence rate of AC algorithms under Markovian noise, the impacts of momentum on AC algorithms remain largely unexplored. In this work, we first propose a heavy-ball momentum based advantage actor-critic (\mbox{HB-A2C}) algorithm by integrating the heavy-ball momentum into the critic recursion that is parameterized by a linear function. When the sample trajectory follows a Markov decision process, we quantitatively certify the acceleration capability of the proposed HB-A2C algorithm. Our theoretical results demonstrate that the proposed HB-A2C finds an $ε$-approximate stationary point with $\oo{ε^{-2}}$ iterations for reinforcement learning tasks with Markovian noise. Moreover, we also reveal the dependence of learning rates on the length of the sample trajectory. By carefully selecting the momentum factor of the critic recursion, the proposed HB-A2C can balance the errors introduced by the initialization and the stoschastic approximation.
