Table of Contents
Fetching ...

Generalized Policy Improvement Algorithms with Theoretically Supported Sample Reuse

James Queeney, Ioannis Ch. Paschalidis, Christos G. Cassandras

TL;DR

The paper develops Generalized Policy Improvement (GPI) algorithms that merge on-policy policy-improvement guarantees with principled sample reuse from recent policies, addressing the trade-off between performance guarantees and data efficiency in deep RL. By introducing a generalized trust-region framework and an optimal mixture over past policies, the authors derive GePPO, GeTRPO, and GeVMPO, which reuse data while preserving approximate policy improvement. Theoretical results show how to set the GPI trust-region parameter and mixture weights to bound risk and maintain guarantees, and empirical results on the DeepMind Control Suite demonstrate substantial gains over on-policy baselines, especially in sparse-reward tasks. The work offers a practical pathway to more data-efficient, reliable policy learning in real-world control scenarios, with future directions toward extending sample reuse to more aggressive off-policy settings.

Abstract

We develop a new class of model-free deep reinforcement learning algorithms for data-driven, learning-based control. Our Generalized Policy Improvement algorithms combine the policy improvement guarantees of on-policy methods with the efficiency of sample reuse, addressing a trade-off between two important deployment requirements for real-world control: (i) practical performance guarantees and (ii) data efficiency. We demonstrate the benefits of this new class of algorithms through extensive experimental analysis on a broad range of simulated control tasks.

Generalized Policy Improvement Algorithms with Theoretically Supported Sample Reuse

TL;DR

The paper develops Generalized Policy Improvement (GPI) algorithms that merge on-policy policy-improvement guarantees with principled sample reuse from recent policies, addressing the trade-off between performance guarantees and data efficiency in deep RL. By introducing a generalized trust-region framework and an optimal mixture over past policies, the authors derive GePPO, GeTRPO, and GeVMPO, which reuse data while preserving approximate policy improvement. Theoretical results show how to set the GPI trust-region parameter and mixture weights to bound risk and maintain guarantees, and empirical results on the DeepMind Control Suite demonstrate substantial gains over on-policy baselines, especially in sparse-reward tasks. The work offers a practical pathway to more data-efficient, reliable policy learning in real-world control scenarios, with future directions toward extending sample reuse to more aggressive off-policy settings.

Abstract

We develop a new class of model-free deep reinforcement learning algorithms for data-driven, learning-based control. Our Generalized Policy Improvement algorithms combine the policy improvement guarantees of on-policy methods with the efficiency of sample reuse, addressing a trade-off between two important deployment requirements for real-world control: (i) practical performance guarantees and (ii) data efficiency. We demonstrate the benefits of this new class of algorithms through extensive experimental analysis on a broad range of simulated control tasks.
Paper Structure (20 sections, 4 theorems, 33 equations, 4 figures, 1 table, 2 algorithms)

This paper contains 20 sections, 4 theorems, 33 equations, 4 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

Consider any policy $\pi$ and a current policy $\pi_k$. Then, we have that where $\operatorname{TV} \left( \pi, \pi_k \right)(s) = \frac{1}{2} \int_{\mathcal{A}} \left| \pi(a \mid s) - \pi_k(a \mid s) \right| \textnormal{d}a$ represents the Total Variation (TV) distance between the distributions $\pi(\, \cdot \mid s)$ and $\pi_k(\, \cdot \mid s)$, and $C^{\pi,\pi_k} = \max

Figures (4)

  • Figure 1: Benefit of GPI update compared to on-policy case. Represents percent increases in effective sample size and total TV distance update size for all possible values of $\kappa \in [0,1]$. Markers indicate $\kappa=0.0,0.5,1.0$. Left: Comparison across several values of $B$. Right: Comparison of non-uniform and uniform weights for large $B$.
  • Figure 2: Generalized vs. on-policy final performance by task. Bars represent final performance of the best performing GPI algorithm and the best performing on-policy algorithm. Excludes 7 tasks where no learning occurs under any algorithm. Sorted from high to low based on on-policy performance.
  • Figure 3: Difference between generalized and on-policy final performance by task. Bars represent difference in final performance between the best performing GPI algorithm and the best performing on-policy algorithm. Excludes 7 tasks where no learning occurs under any algorithm. Sorted from high to low.
  • Figure 4: Average difference between generalized and on-policy final performance for different sparsity levels. Bars represent difference in final performance between the best performing GPI algorithm and the best performing on-policy algorithm, averaged across all tasks with the specified sparsity level. Excludes 7 tasks where no learning occurs under any algorithm. Labels represent improvement relative to the average performance gain across all tasks, denoted by the vertical dotted line.

Theorems & Definitions (10)

  • Theorem 1: From achiam_2017
  • Definition 1
  • Theorem 2: From queeney_2021_geppo
  • Definition 2
  • Theorem 3
  • proof
  • Definition 3
  • Definition 4
  • Theorem 4
  • proof