A Distributional Analogue to the Successor Representation
Harley Wiltzer, Jesse Farebrother, Arthur Gretton, Yunhao Tang, André Barreto, Will Dabney, Marc G. Bellemare, Mark Rowland
TL;DR
The paper introduces the distributional successor measure (DSM), a distribution over occupancy measures M^π, to separate transition structure from rewards in distributional RL. It derives a distributional Bellman framework and shows that the return distribution for a reward function r can be obtained from the distribution of M^π, enabling zero-shot distributional evaluation for unseen rewards. The delta-model approximates the distributional SM with m atoms (θ_i(x)), learned via a two-level MMD loss with adaptive kernels, and experiments on Windy Gridworld and Pendulum demonstrate accurate return distributions and risk-sensitive policy ranking without additional data collection. Overall, the method avoids the accumulation of rollout errors, enables zero-shot evaluation on unseen rewards, and provides a practical approach to risk-aware, long-horizon decision making in continuous state spaces.
Abstract
This paper contributes a new approach for distributional reinforcement learning which elucidates a clean separation of transition structure and reward in the learning process. Analogous to how the successor representation (SR) describes the expected consequences of behaving according to a given policy, our distributional successor measure (SM) describes the distributional consequences of this behaviour. We formulate the distributional SM as a distribution over distributions and provide theory connecting it with distributional and model-based reinforcement learning. Moreover, we propose an algorithm that learns the distributional SM from data by minimizing a two-level maximum mean discrepancy. Key to our method are a number of algorithmic techniques that are independently valuable for learning generative models of state. As an illustration of the usefulness of the distributional SM, we show that it enables zero-shot risk-sensitive policy evaluation in a way that was not previously possible.
