Off-policy Distributional Q($λ$): Distributional RL without Importance Sampling
Yunhao Tang, Mark Rowland, Rémi Munos, Bernardo Ávila Pires, Will Dabney
TL;DR
This work introduces off-policy distributional Q($\lambda$), a multi-step distributional RL operator that learns the target return distribution without importance sampling. It shows that the operator has the fixed point $\eta^\pi$ and is contractive when the data-collection policy $\mu$ is close to the target policy $\pi$, with a contraction radius depending on $\gamma$, $\lambda$, and policy mismatch, and that intermediate iterates are signed measures requiring $\mathscr{M}_1(\mathbb{R})$ representations. The paper develops a practical learning approach via a categorical (C51) representation, using back-up targets formed from $\mathcal{A}_\lambda^{\pi,\mu}$ and a projection, and introduces a trust-region-inspired target policy mixing to improve stability in deep RL. Empirically, it demonstrates faster contraction and competitive performance on tabular tasks and Atari-57, with mixing coefficients around $0.6$–$0.8$ often yielding best results. Overall, the method broadens distributional RL by enabling off-policy learning without IS, highlighting the role of signed measures in multi-step updates and offering practical guidance for stable deep RL deployments.
Abstract
We introduce off-policy distributional Q($λ$), a new addition to the family of off-policy distributional evaluation algorithms. Off-policy distributional Q($λ$) does not apply importance sampling for off-policy learning, which introduces intriguing interactions with signed measures. Such unique properties distributional Q($λ$) from other existing alternatives such as distributional Retrace. We characterize the algorithmic properties of distributional Q($λ$) and validate theoretical insights with tabular experiments. We show how distributional Q($λ$)-C51, a combination of Q($λ$) with the C51 agent, exhibits promising results on deep RL benchmarks.
