When Do Off-Policy and On-Policy Policy Gradient Methods Align?

Davide Mambelli; Stephan Bongers; Onno Zoeter; Matthijs T. J. Spaan; Frans A. Oliehoek

When Do Off-Policy and On-Policy Policy Gradient Methods Align?

Davide Mambelli, Stephan Bongers, Onno Zoeter, Matthijs T. J. Spaan, Frans A. Oliehoek

TL;DR

This work analyzes when off-policy policy gradient methods that optimize the excursion objective align with the true on-policy deployment performance. By linking state visitation, stationary distributions, and discounted rewards, the authors prove that for finite-state irreducible and aperiodic Markov chains, the excursion and on-policy objectives converge as the discount factor $\gamma$ approaches 1, and they provide explicit bounds on the gradient mismatch. Theoretical results are complemented by empirical validation on a simple two-state MDP and offline policy ranking experiments in DeepMind Control Suite, demonstrating that larger $\gamma$ improves alignment but that misalignment can persist in some continuous-control environments. The findings offer practical guidance for using excursion-based off-policy methods, highlighting when they reliably reflect deployment performance and how to trade off alignment with convergence speed in policy optimization.

Abstract

Policy gradient methods are widely adopted reinforcement learning algorithms for tasks with continuous action spaces. These methods succeeded in many application domains, however, because of their notorious sample inefficiency their use remains limited to problems where fast and accurate simulations are available. A common way to improve sample efficiency is to modify their objective function to be computable from off-policy samples without importance sampling. A well-established off-policy objective is the excursion objective. This work studies the difference between the excursion objective and the traditional on-policy objective, which we refer to as the on-off gap. We provide the first theoretical analysis showing conditions to reduce the on-off gap while establishing empirical evidence of shortfalls arising when these conditions are not met.

When Do Off-Policy and On-Policy Policy Gradient Methods Align?

TL;DR

approaches 1, and they provide explicit bounds on the gradient mismatch. Theoretical results are complemented by empirical validation on a simple two-state MDP and offline policy ranking experiments in DeepMind Control Suite, demonstrating that larger

improves alignment but that misalignment can persist in some continuous-control environments. The findings offer practical guidance for using excursion-based off-policy methods, highlighting when they reliably reflect deployment performance and how to trade off alignment with convergence speed in policy optimization.

Abstract

Paper Structure (26 sections, 12 theorems, 53 equations, 7 figures)

This paper contains 26 sections, 12 theorems, 53 equations, 7 figures.

Introduction
Background
Discounted reward
Policy Gradient Methods
Problem Formulation and Related Works
State Visitation and Stationarity
Average Reward
Discounted Reward
When Do the Objectives Coincide?
Summary
Experiments
Empirical Validation
Practical Implications
Discussion
Conclusions
...and 11 more sections

Key Result

Lemma 4.3

For an MDP with finite states and starting distribution $\mu\in\Delta(\mathcal{S})$, it holds that

Figures (7)

Figure 1: Stylized visualization On-Off Gap vs. $\gamma$.
Figure 2: Considering a two-state MDP, we draw $V_\pi(s)$ for each state as a function of the policy $\pi\in\Pi$. Choosing a distribution $\nu\in\Delta(\mathcal{S})$ for the objective $J_{\nu}(\pi)=\mathbb{E}_{s\sim \nu}[V_\pi(s)]$ corresponds to selecting a mixing for the values. We refer to the policy that simultaneously maximizes the value at all states as $\pi^*$. When $\pi^*$ is within the function class, for any distribution $\nu$, we have $\pi^* = \pi^*_{\nu}$, where $\pi^*_{\nu}=\text{argmax}_{\pi\in\Pi}J_{\nu}(\pi)$. Otherwise, different initial state distributions might lead to different optima. For instance, in the example above by selecting two distributions $\nu_0(s_0)=1$ and $\nu_1(s_1)=1$ and restricting the argmax to be over the function class to $\overline{\Pi}\subset\Pi$ (represented by the green line), we have two distinct policies that maximize their respective objective function, $\pi^*_{\nu_0}\neq\pi^*_{\nu_1}$.
Figure 3: On-Off gap as mean and 95% confidence interval: (a) Dependence of value error, computed as $(1-\gamma)|J_\mu(\pi)-J_{d_b}(\pi)|$, on $\gamma$. (b) Dependence of gradient error, computed as $(1-\gamma)\lVert \nabla J_\mu(\pi)-\nabla J_{d_b}(\pi)\lVert$, on $\gamma$.
Figure 4: Performances in offline policy selection (and 95% confidence intervals) across 3 Mujoco environments as a function of $\gamma$ when ranking policies with the excursion objective or randomly, with 1 being the same ranking as ground truth.
Figure 5: The environment executes the action the agent selects with probability $q$. Therefore, $Pr(s|\text{"stay",\,s})=Pr(\neg s |\text{"move",\,s})=q$. The reward function is $r(s_1,\,-)=1$ and $r(s_0,\,-)=0$.
...and 2 more figures

Theorems & Definitions (21)

Remark 4.1
Definition 4.2
Lemma 4.3
Corollary 4.4
Theorem 5.1
Theorem 5.2
Theorem 5.3
Corollary 5.4
Definition
Theorem
...and 11 more

When Do Off-Policy and On-Policy Policy Gradient Methods Align?

TL;DR

Abstract

When Do Off-Policy and On-Policy Policy Gradient Methods Align?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (21)