Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline

Wenjia Meng; Qian Zheng; Long Yang; Yilong Yin; Gang Pan

Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline

Wenjia Meng, Qian Zheng, Long Yang, Yilong Yin, Gang Pan

TL;DR

This work tackles the high variance of off-policy policy gradient (OPPG) estimators by introducing Off-OAB, an unbiased action-dependent baseline that minimizes OPPG variance. It develops the optimal per-dimension baseline under a diagonal Gaussian policy, and demonstrates that this action-aware baseline can reduce variance more effectively than the best state-dependent baseline. To keep the method practical, the authors propose a computationally efficient approximation $b_i(s,a^{-i}) \approx \mathbb{E}_{a^i\sim\mu}[Q_ta(s,a)]$, leading to the Off-OAB algorithm that integrates this baseline into the OPPG update via a replay-buffered critic. Extensive experiments on six continuous-control tasks from OpenAI Gym and MuJoCo show that Off-OAB delivers improved sample efficiency and higher returns than several state-of-the-art methods, validating the variance-reduction benefits of action-dependent baselines in off-policy learning.

Abstract

Policy-based methods have achieved remarkable success in solving challenging reinforcement learning problems. Among these methods, off-policy policy gradient methods are particularly important due to that they can benefit from off-policy data. However, these methods suffer from the high variance of the off-policy policy gradient (OPPG) estimator, which results in poor sample efficiency during training. In this paper, we propose an off-policy policy gradient method with the optimal action-dependent baseline (Off-OAB) to mitigate this variance issue. Specifically, this baseline maintains the OPPG estimator's unbiasedness while theoretically minimizing its variance. To enhance practical computational efficiency, we design an approximated version of this optimal baseline. Utilizing this approximation, our method (Off-OAB) aims to decrease the OPPG estimator's variance during policy optimization. We evaluate the proposed Off-OAB method on six representative tasks from OpenAI Gym and MuJoCo, where it demonstrably surpasses state-of-the-art methods on the majority of these tasks.

Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline

TL;DR

, leading to the Off-OAB algorithm that integrates this baseline into the OPPG update via a replay-buffered critic. Extensive experiments on six continuous-control tasks from OpenAI Gym and MuJoCo show that Off-OAB delivers improved sample efficiency and higher returns than several state-of-the-art methods, validating the variance-reduction benefits of action-dependent baselines in off-policy learning.

Abstract

Paper Structure (24 sections, 4 theorems, 51 equations, 3 figures, 3 tables, 2 algorithms)

This paper contains 24 sections, 4 theorems, 51 equations, 3 figures, 3 tables, 2 algorithms.

Introduction
Preliminaries & Background
Action-Dependent Baseline for Off-Policy Policy Gradient (OPPG) Estimator
Unbiased Off-Policy Action-Dependent Baseline
Optimal Off-Policy Action-Dependent Baseline
Variance Reduction Compared to the Optimal State-Dependent Baseline
Proposed Off-OAB
Approximated Optimal Off-Policy Baseline
Detailed Algorithm Implementation for Off-OAB
Experiments
Setup
Comparison with the State-of-the-art Methods
Comparison with Other Baselines
Study on Sample Efficiency
Study on Variance Reduction
...and 9 more sections

Key Result

Theorem 1

Let $\mathcal{g}_{\text{off}}(b)$ be the off-policy policy gradient estimator defined in Eq. (non-biased estimator). The optimal off-policy action-dependent baseline that minimizes the variance of $\mathcal{g}_{\text{off}}(b)$ is where $\rho(s, a)=\frac{\pi(a|s)}{\mu(a|s)}$.

Figures (3)

Figure 1: Results of proposed Off-OAB method and other state-of-the-art deep reinforcement learning methods (ACER, IMPALA, IPG, SAC, TD3, PPO, SLAC, and PGAFB) on representative tasks. The standard deviation over five seeded runs is denoted by the shaded region. The $X$-aixs and $Y$-axis separately denote environment timesteps and average return.
Figure 2: Results of our method with varying baselines. The standard deviation over five seeded runs is denoted by the shaded region. The $X$-aixs and $Y$-axis separately denote environment timesteps and average return.
Figure 3: Results of our method with different baselines (without baseline, state-dependent baseline, action-dependent baseline) on Hopper, Walker2d, and Ant. The shaded region indicates the standard deviation over five random seeds. The $X$-aixs denotes the environment timesteps. The $Y$-axis denotes the logarithm of the gradient variance.

Theorems & Definitions (14)

Remark 1: Unbiasedness of Action-dependent Baseline
proof
Theorem 1: Optimal Off-Policy Action-Dependent Baseline
proof
Theorem 2: Variance Difference between Optimal State and Action-Dependent Baseline
proof
Theorem 3: Close to Optimal Action-Dependent Baseline
proof
proof
proof
...and 4 more

Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline

TL;DR

Abstract

Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (14)