Compatible Gradient Approximations for Actor-Critic Algorithms

Baturay Saglam; Dionysis Kalogerias

Compatible Gradient Approximations for Actor-Critic Algorithms

Baturay Saglam, Dionysis Kalogerias

TL;DR

The paper tackles the incompatibility between deterministic policy gradient and function-approximated critics in deep RL by introducing oCPG, a zeroth-order gradient surrogate that estimates the action-value gradient through two-point perturbations in action space. It leverages a smoothed Q-function $Q^{\pi_\theta}_\mu$ and a parametric critic $Q_\psi$ to form the gradient estimate $\hat{\nabla}^{\mu,\psi}J(\theta)$, with theoretical bounds showing the gradient error can be made small by controlling the perturbation scale $\mu$ and the PRE $\varepsilon_{\mu,\psi}^{\pi_\theta}$. The approach integrates with off-policy TD3-style learning using two Q-networks and replay buffers, providing practical stability and bias-reduction benefits. Empirically, oCPG frequently matches or surpasses state-of-the-art methods on MuJoCo continuous-control tasks, including under imperfect environmental conditions, highlighting its robustness to gradient-approximation issues and its potential for broader application in model-free RL.

Abstract

Deterministic policy gradient algorithms are foundational for actor-critic methods in controlling continuous systems, yet they often encounter inaccuracies due to their dependence on the derivative of the critic's value estimates with respect to input actions. This reliance requires precise action-value gradient computations, a task that proves challenging under function approximation. We introduce an actor-critic algorithm that bypasses the need for such precision by employing a zeroth-order approximation of the action-value gradient through two-point stochastic gradient estimation within the action space. This approach provably and effectively addresses compatibility issues inherent in deterministic policy gradient schemes. Empirical results further demonstrate that our algorithm not only matches but frequently exceeds the performance of current state-of-the-art methods by a substantial extent.

Compatible Gradient Approximations for Actor-Critic Algorithms

TL;DR

and a parametric critic

to form the gradient estimate

, with theoretical bounds showing the gradient error can be made small by controlling the perturbation scale

and the PRE

. The approach integrates with off-policy TD3-style learning using two Q-networks and replay buffers, providing practical stability and bias-reduction benefits. Empirically, oCPG frequently matches or surpasses state-of-the-art methods on MuJoCo continuous-control tasks, including under imperfect environmental conditions, highlighting its robustness to gradient-approximation issues and its potential for broader application in model-free RL.

Abstract

Paper Structure (40 sections, 3 theorems, 22 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 40 sections, 3 theorems, 22 equations, 3 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Technical Preliminaries
Off-Policy Compatible Policy Gradient
Problem Statement
Provably Compatible Policy Gradient Approximations
Approximating the $Q$-function
Off-Policy Deep Reinforcement Learning
Clipped Double $Q$-learning td3
Off-Policy Learning
Experiments
Experimental Setup
Evaluation
Implementation and Hyperparameters
Selection of $\mu$
...and 25 more sections

Key Result

Proposition 1

Let $f: \mathbb{R}^p \rightarrow \mathbb{R}$ be a bounded function. For every $\mu > 0$, the smoothed function $f_\mu (x) \coloneqq \mathbb{E}_{\boldsymbol{u}}[f(x + \mu\boldsymbol{u}) ]$, $\boldsymbol{u} \sim \mathcal{N}(0, I_p)$ is well-defined, differentiable and its gradient admits the represent Further, if $f$ is $G$-smooth (i.e., with $G$-Lipschitz gradients), it holds that

Figures (3)

Figure 1: Learning curves for benchmark MuJoCo environments, averaged over 10 random seeds. Evaluation reward is calculated as the undiscounted sum of rewards collected by the agent during each evaluation episode. The shaded region indicates the 95% confidence interval of the mean performance.
Figure 2: Sketch of a function fitting problem containing finite number of samples.
Figure 3: Learning curves for the benchmark MuJoCo environments, averaged over 10 random seeds, under imperfect environment conditions: (a) the agent observes the true reward with a probability of 0.5, otherwise zero; (b) rewards are delayed by 10 time steps; and (c) rewards are perturbed by a zero-mean Gaussian noise with a standard deviation equal to 0.1 of the reward range in the replay buffer. The shaded area represents the 95% confidence interval of the mean performance.

Theorems & Definitions (3)

Proposition 1: nesterov_random_grad_free
Theorem 1: [Compatible Policy Gradient]
Theorem 2: Compatible Function Approximation

Compatible Gradient Approximations for Actor-Critic Algorithms

TL;DR

Abstract

Compatible Gradient Approximations for Actor-Critic Algorithms

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (3)