Compatible Gradient Approximations for Actor-Critic Algorithms
Baturay Saglam, Dionysis Kalogerias
TL;DR
The paper tackles the incompatibility between deterministic policy gradient and function-approximated critics in deep RL by introducing oCPG, a zeroth-order gradient surrogate that estimates the action-value gradient through two-point perturbations in action space. It leverages a smoothed Q-function $Q^{\pi_\theta}_\mu$ and a parametric critic $Q_\psi$ to form the gradient estimate $\hat{\nabla}^{\mu,\psi}J(\theta)$, with theoretical bounds showing the gradient error can be made small by controlling the perturbation scale $\mu$ and the PRE $\varepsilon_{\mu,\psi}^{\pi_\theta}$. The approach integrates with off-policy TD3-style learning using two Q-networks and replay buffers, providing practical stability and bias-reduction benefits. Empirically, oCPG frequently matches or surpasses state-of-the-art methods on MuJoCo continuous-control tasks, including under imperfect environmental conditions, highlighting its robustness to gradient-approximation issues and its potential for broader application in model-free RL.
Abstract
Deterministic policy gradient algorithms are foundational for actor-critic methods in controlling continuous systems, yet they often encounter inaccuracies due to their dependence on the derivative of the critic's value estimates with respect to input actions. This reliance requires precise action-value gradient computations, a task that proves challenging under function approximation. We introduce an actor-critic algorithm that bypasses the need for such precision by employing a zeroth-order approximation of the action-value gradient through two-point stochastic gradient estimation within the action space. This approach provably and effectively addresses compatibility issues inherent in deterministic policy gradient schemes. Empirical results further demonstrate that our algorithm not only matches but frequently exceeds the performance of current state-of-the-art methods by a substantial extent.
