Table of Contents
Fetching ...

Off-policy Distributional Q($λ$): Distributional RL without Importance Sampling

Yunhao Tang, Mark Rowland, Rémi Munos, Bernardo Ávila Pires, Will Dabney

TL;DR

This work introduces off-policy distributional Q($\lambda$), a multi-step distributional RL operator that learns the target return distribution without importance sampling. It shows that the operator has the fixed point $\eta^\pi$ and is contractive when the data-collection policy $\mu$ is close to the target policy $\pi$, with a contraction radius depending on $\gamma$, $\lambda$, and policy mismatch, and that intermediate iterates are signed measures requiring $\mathscr{M}_1(\mathbb{R})$ representations. The paper develops a practical learning approach via a categorical (C51) representation, using back-up targets formed from $\mathcal{A}_\lambda^{\pi,\mu}$ and a projection, and introduces a trust-region-inspired target policy mixing to improve stability in deep RL. Empirically, it demonstrates faster contraction and competitive performance on tabular tasks and Atari-57, with mixing coefficients around $0.6$–$0.8$ often yielding best results. Overall, the method broadens distributional RL by enabling off-policy learning without IS, highlighting the role of signed measures in multi-step updates and offering practical guidance for stable deep RL deployments.

Abstract

We introduce off-policy distributional Q($λ$), a new addition to the family of off-policy distributional evaluation algorithms. Off-policy distributional Q($λ$) does not apply importance sampling for off-policy learning, which introduces intriguing interactions with signed measures. Such unique properties distributional Q($λ$) from other existing alternatives such as distributional Retrace. We characterize the algorithmic properties of distributional Q($λ$) and validate theoretical insights with tabular experiments. We show how distributional Q($λ$)-C51, a combination of Q($λ$) with the C51 agent, exhibits promising results on deep RL benchmarks.

Off-policy Distributional Q($λ$): Distributional RL without Importance Sampling

TL;DR

This work introduces off-policy distributional Q(), a multi-step distributional RL operator that learns the target return distribution without importance sampling. It shows that the operator has the fixed point and is contractive when the data-collection policy is close to the target policy , with a contraction radius depending on , , and policy mismatch, and that intermediate iterates are signed measures requiring representations. The paper develops a practical learning approach via a categorical (C51) representation, using back-up targets formed from and a projection, and introduces a trust-region-inspired target policy mixing to improve stability in deep RL. Empirically, it demonstrates faster contraction and competitive performance on tabular tasks and Atari-57, with mixing coefficients around often yielding best results. Overall, the method broadens distributional RL by enabling off-policy learning without IS, highlighting the role of signed measures in multi-step updates and offering practical guidance for stable deep RL deployments.

Abstract

We introduce off-policy distributional Q(), a new addition to the family of off-policy distributional evaluation algorithms. Off-policy distributional Q() does not apply importance sampling for off-policy learning, which introduces intriguing interactions with signed measures. Such unique properties distributional Q() from other existing alternatives such as distributional Retrace. We characterize the algorithmic properties of distributional Q() and validate theoretical insights with tabular experiments. We show how distributional Q()-C51, a combination of Q() with the C51 agent, exhibits promising results on deep RL benchmarks.
Paper Structure (37 sections, 14 theorems, 54 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 37 sections, 14 theorems, 54 equations, 5 figures, 1 table, 1 algorithm.

Key Result

Lemma 0

(Closeness of the space of signed measures) Given any $\eta\in\mathscr{M}_1(\mathbb{R})^{\mathcal{X}\times\mathcal{A}}$, we have $\mathcal{A}_\lambda^{\pi,\mu}\eta\in\mathscr{M}_1(\mathbb{R})^{\mathcal{X}\times\mathcal{A}}$.

Figures (5)

  • Figure 1: An illustration of the signed measure properties specific to the off-policy Q($\lambda$) operator. The blue and green bars represent the positive and negative probability masses of unit mass signed measures. We visualize the iterate $\eta_{k+1}=\mathcal{A}_\lambda^{\pi,\mu}\eta_k$ for a fixed state-action pair over time on a tabular MDP. The iterate starts as a distribution (an element in $\mathcal{P}(\mathbb{R})$, transitions into a signed measure with unit mass (an element in $\mathcal{M}_1(\mathbb{R})$, and eventually converge to the target return distribution $\eta^\pi$, which is itself a distribution. Any prior distributional RL policy evaluation operators will not exhibit such intriguing behavior, as their iterates are always distributions.
  • Figure 2: Illustration of categorical projection for the signed measure. On the left, we have a signed measure $\eta\in\mathscr{M}_1(\mathbb{R})$; on the right, we show the categorical projection of the signed measure $\eta$ onto the space $\mathscr{M}_{1,c}(\mathbb{R})$, with green bars showing the negative mass of the projected measure. The categorical projection is a discretized approximation to the original signed measure, with increasing accuracy as the number of atoms $(z_i)_{i=1}^m$ increases.
  • Figure 3: The distance between the algorithmic iterate $\eta_k$ and return distribution for the optimal policy $\eta^\ast$, as we run control algorithms with distributional one-step, Retrace and off-policy Q($\lambda$). All algorithms use categorical representations and set greedy policy as the target policy. Different curves show an algorithmic variant with a different hyper-parameter setting ($\bar{c}$ for Retrace and $\lambda$ for Q($\lambda$)). Note that Q($\lambda$) can obtain better performance than Retrace when $\lambda$ is chosen properly; when $\lambda$ is too large ($\geq 0.7$ in this case), the algorithm diverges -- despite the initial fast decay in the distance, will not converge to the correct fixed point.
  • Figure 4: Comparison of C51 bellemare2017distributional, Retrace-C51 tang2022nature and off-policy distributional Q($\lambda$) with target mixing $\alpha=0.6$ based on Eqn \ref{['eq:mixing-target']} and $\lambda=0.4$. We show the agents' average performance metrics evaluated throughout training: the inter-quartile mean score agarwal2021deep, which can be understood as a more robust estimate to the mean score; and the median score, calculated across all 57 games. All scores show the mean and bootstrapped confidence intervals across $5$ seeds agarwal2021deep. Off-policy distributional Q($\lambda$) obtains performance improvements over Retrace-C51 when using target mixing.
  • Figure 5: The distance between the algorithmic iterate $\eta_k$ and return distribution for the optimal policy $\eta^\ast$, as we run control algorithms with distributional one-step, Retrace and off-policy Q($\lambda$). All algorithms use categorical representations and set greedy policy as the target policy. Different curves show an algorithmic variant with a different hyper-parameter setting ($\bar{c}$ for Retrace and $\lambda$ for Q($\lambda$)). Unlike Figure \ref{['fig:qlambda_control']} with $|\mathcal{A}|=20$, here with $|\mathcal{A}|=5$ all algorithmic behavior changes slightly. Since the problem effectively becomes less off-policy, Retrace can benefit from the full trace with $\bar{c}=4$, outperforming Q($\lambda$); meanwhile, Q($\lambda$) becomes more stable across all $\lambda$ values.

Theorems & Definitions (22)

  • Lemma 0
  • Lemma 0
  • Lemma 0
  • Corollary 0
  • Lemma 0
  • Lemma 0
  • Lemma 0
  • proof
  • Lemma 0
  • proof
  • ...and 12 more