Off-policy Distributional Q($λ$): Distributional RL without Importance Sampling

Yunhao Tang; Mark Rowland; Rémi Munos; Bernardo Ávila Pires; Will Dabney

Off-policy Distributional Q($λ$): Distributional RL without Importance Sampling

Yunhao Tang, Mark Rowland, Rémi Munos, Bernardo Ávila Pires, Will Dabney

TL;DR

This work introduces off-policy distributional Q($\lambda$), a multi-step distributional RL operator that learns the target return distribution without importance sampling. It shows that the operator has the fixed point $\eta^\pi$ and is contractive when the data-collection policy $\mu$ is close to the target policy $\pi$, with a contraction radius depending on $\gamma$, $\lambda$, and policy mismatch, and that intermediate iterates are signed measures requiring $\mathscr{M}_1(\mathbb{R})$ representations. The paper develops a practical learning approach via a categorical (C51) representation, using back-up targets formed from $\mathcal{A}_\lambda^{\pi,\mu}$ and a projection, and introduces a trust-region-inspired target policy mixing to improve stability in deep RL. Empirically, it demonstrates faster contraction and competitive performance on tabular tasks and Atari-57, with mixing coefficients around $0.6$–$0.8$ often yielding best results. Overall, the method broadens distributional RL by enabling off-policy learning without IS, highlighting the role of signed measures in multi-step updates and offering practical guidance for stable deep RL deployments.

Abstract

We introduce off-policy distributional Q($λ$), a new addition to the family of off-policy distributional evaluation algorithms. Off-policy distributional Q($λ$) does not apply importance sampling for off-policy learning, which introduces intriguing interactions with signed measures. Such unique properties distributional Q($λ$) from other existing alternatives such as distributional Retrace. We characterize the algorithmic properties of distributional Q($λ$) and validate theoretical insights with tabular experiments. We show how distributional Q($λ$)-C51, a combination of Q($λ$) with the C51 agent, exhibits promising results on deep RL benchmarks.

Off-policy Distributional Q($λ$): Distributional RL without Importance Sampling

TL;DR

This work introduces off-policy distributional Q(

), a multi-step distributional RL operator that learns the target return distribution without importance sampling. It shows that the operator has the fixed point

and is contractive when the data-collection policy

is close to the target policy

, with a contraction radius depending on

, and policy mismatch, and that intermediate iterates are signed measures requiring

representations. The paper develops a practical learning approach via a categorical (C51) representation, using back-up targets formed from

and a projection, and introduces a trust-region-inspired target policy mixing to improve stability in deep RL. Empirically, it demonstrates faster contraction and competitive performance on tabular tasks and Atari-57, with mixing coefficients around

–

often yielding best results. Overall, the method broadens distributional RL by enabling off-policy learning without IS, highlighting the role of signed measures in multi-step updates and offering practical guidance for stable deep RL deployments.

Abstract

We introduce off-policy distributional Q(

), a new addition to the family of off-policy distributional evaluation algorithms. Off-policy distributional Q(

) does not apply importance sampling for off-policy learning, which introduces intriguing interactions with signed measures. Such unique properties distributional Q(

) from other existing alternatives such as distributional Retrace. We characterize the algorithmic properties of distributional Q(

) and validate theoretical insights with tabular experiments. We show how distributional Q(

)-C51, a combination of Q(

) with the C51 agent, exhibits promising results on deep RL benchmarks.

Paper Structure (37 sections, 14 theorems, 54 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 37 sections, 14 theorems, 54 equations, 5 figures, 1 table, 1 algorithm.

Introduction
Contraction and fixed point.
Signed measures and representations.
Trust region interpretation and deep RL.
Background
Multi-step value-based learning and off-policy Q($\lambda$)
Distributional reinforcement learning
Multi-step distributional RL
Off-policy distributional Q($\lambda$)
Off-policy dist. Q($\lambda$) targets are signed measures
Fixed point and contraction property
Alternative way to construct distributional Q($\lambda$).
Learning with categorical representation
Brief background.
Fixed point and contraction property
...and 22 more sections

Key Result

Lemma 0

(Closeness of the space of signed measures) Given any $\eta\in\mathscr{M}_1(\mathbb{R})^{\mathcal{X}\times\mathcal{A}}$, we have $\mathcal{A}_\lambda^{\pi,\mu}\eta\in\mathscr{M}_1(\mathbb{R})^{\mathcal{X}\times\mathcal{A}}$.

Figures (5)

Figure 1: An illustration of the signed measure properties specific to the off-policy Q($\lambda$) operator. The blue and green bars represent the positive and negative probability masses of unit mass signed measures. We visualize the iterate $\eta_{k+1}=\mathcal{A}_\lambda^{\pi,\mu}\eta_k$ for a fixed state-action pair over time on a tabular MDP. The iterate starts as a distribution (an element in $\mathcal{P}(\mathbb{R})$, transitions into a signed measure with unit mass (an element in $\mathcal{M}_1(\mathbb{R})$, and eventually converge to the target return distribution $\eta^\pi$, which is itself a distribution. Any prior distributional RL policy evaluation operators will not exhibit such intriguing behavior, as their iterates are always distributions.
Figure 2: Illustration of categorical projection for the signed measure. On the left, we have a signed measure $\eta\in\mathscr{M}_1(\mathbb{R})$; on the right, we show the categorical projection of the signed measure $\eta$ onto the space $\mathscr{M}_{1,c}(\mathbb{R})$, with green bars showing the negative mass of the projected measure. The categorical projection is a discretized approximation to the original signed measure, with increasing accuracy as the number of atoms $(z_i)_{i=1}^m$ increases.
Figure 3: The distance between the algorithmic iterate $\eta_k$ and return distribution for the optimal policy $\eta^\ast$, as we run control algorithms with distributional one-step, Retrace and off-policy Q($\lambda$). All algorithms use categorical representations and set greedy policy as the target policy. Different curves show an algorithmic variant with a different hyper-parameter setting ($\bar{c}$ for Retrace and $\lambda$ for Q($\lambda$)). Note that Q($\lambda$) can obtain better performance than Retrace when $\lambda$ is chosen properly; when $\lambda$ is too large ($\geq 0.7$ in this case), the algorithm diverges -- despite the initial fast decay in the distance, will not converge to the correct fixed point.
Figure 4: Comparison of C51 bellemare2017distributional, Retrace-C51 tang2022nature and off-policy distributional Q($\lambda$) with target mixing $\alpha=0.6$ based on Eqn \ref{['eq:mixing-target']} and $\lambda=0.4$. We show the agents' average performance metrics evaluated throughout training: the inter-quartile mean score agarwal2021deep, which can be understood as a more robust estimate to the mean score; and the median score, calculated across all 57 games. All scores show the mean and bootstrapped confidence intervals across $5$ seeds agarwal2021deep. Off-policy distributional Q($\lambda$) obtains performance improvements over Retrace-C51 when using target mixing.
Figure 5: The distance between the algorithmic iterate $\eta_k$ and return distribution for the optimal policy $\eta^\ast$, as we run control algorithms with distributional one-step, Retrace and off-policy Q($\lambda$). All algorithms use categorical representations and set greedy policy as the target policy. Different curves show an algorithmic variant with a different hyper-parameter setting ($\bar{c}$ for Retrace and $\lambda$ for Q($\lambda$)). Unlike Figure \ref{['fig:qlambda_control']} with $|\mathcal{A}|=20$, here with $|\mathcal{A}|=5$ all algorithmic behavior changes slightly. Since the problem effectively becomes less off-policy, Retrace can benefit from the full trace with $\bar{c}=4$, outperforming Q($\lambda$); meanwhile, Q($\lambda$) becomes more stable across all $\lambda$ values.

Theorems & Definitions (22)

Lemma 0
Lemma 0
Lemma 0
Corollary 0
Lemma 0
Lemma 0
Lemma 0
proof
Lemma 0
proof
...and 12 more

Off-policy Distributional Q($λ$): Distributional RL without Importance Sampling

TL;DR

Abstract

Off-policy Distributional Q($λ$): Distributional RL without Importance Sampling

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (22)