A Distributional Analogue to the Successor Representation

Harley Wiltzer; Jesse Farebrother; Arthur Gretton; Yunhao Tang; André Barreto; Will Dabney; Marc G. Bellemare; Mark Rowland

A Distributional Analogue to the Successor Representation

Harley Wiltzer, Jesse Farebrother, Arthur Gretton, Yunhao Tang, André Barreto, Will Dabney, Marc G. Bellemare, Mark Rowland

TL;DR

The paper introduces the distributional successor measure (DSM), a distribution over occupancy measures M^π, to separate transition structure from rewards in distributional RL. It derives a distributional Bellman framework and shows that the return distribution for a reward function r can be obtained from the distribution of M^π, enabling zero-shot distributional evaluation for unseen rewards. The delta-model approximates the distributional SM with m atoms (θ_i(x)), learned via a two-level MMD loss with adaptive kernels, and experiments on Windy Gridworld and Pendulum demonstrate accurate return distributions and risk-sensitive policy ranking without additional data collection. Overall, the method avoids the accumulation of rollout errors, enables zero-shot evaluation on unseen rewards, and provides a practical approach to risk-aware, long-horizon decision making in continuous state spaces.

Abstract

This paper contributes a new approach for distributional reinforcement learning which elucidates a clean separation of transition structure and reward in the learning process. Analogous to how the successor representation (SR) describes the expected consequences of behaving according to a given policy, our distributional successor measure (SM) describes the distributional consequences of this behaviour. We formulate the distributional SM as a distribution over distributions and provide theory connecting it with distributional and model-based reinforcement learning. Moreover, we propose an algorithm that learns the distributional SM from data by minimizing a two-level maximum mean discrepancy. Key to our method are a number of algorithmic techniques that are independently valuable for learning generative models of state. As an illustration of the usefulness of the distributional SM, we show that it enables zero-shot risk-sensitive policy evaluation in a way that was not previously possible.

A Distributional Analogue to the Successor Representation

TL;DR

Abstract

Paper Structure (33 sections, 11 theorems, 47 equations, 9 figures, 1 table, 1 algorithm)

This paper contains 33 sections, 11 theorems, 47 equations, 9 figures, 1 table, 1 algorithm.

Introduction
Background
Successor Measure
Distributional Policy Evaluation
The Distributional SM
Random Occupancy Measures
Distributional SM Bellman Equations
Representing and Learning the DSM
Representation by $\delta$-models
Learning from Samples
Practical Training of $\delta$-models
$n$-step Bootstrapping
Kernel Selection
Experimental Results
Related Work
...and 18 more sections

Key Result

Proposition 3.1

Let $M^\pi$ denote a random discounted state-occupancy measure for a given policy $\pi$. For any deterministic reward function $r:\mathcal{X}\to\mathbb{R}$, we have Note that the right-hand side is a random variable, since $M^\pi(\cdot\mid x)$ itself is a random distribution.

Figures (9)

Figure 1: Illustration of the standard and distributional successor measure (SM) in a T-Maze MDP, for a policy that moves to the fork and goes backwards, right, or left, with probabilities $\frac{1}{6},\frac{1}{2},\frac{1}{3}$. Left: The distributional SM $\daleth^{\pi}$ (top) consisting of atoms $\theta_1, \theta_2, \theta_3$ depicting the occupancy measures (probability distributions) corresponding to the distinct behaviors exhibited by the policy, and the SM $\Psi^\pi$ (bottom) $\Psi^\pi = \frac{\theta_1}{6} + \frac{\theta_2}{2} + \frac{\theta_3}{3}$. Right: Zero-shot distributional policy evaluation (top) with $\daleth^{\pi}$ and zero-shot policy evaluation (bottom) with $\Psi^\pi$.
Figure 2: The components of a $\delta$-model (Section \ref{['sec:rep-learn-dsr:disrete']}), and the kernels and distances involved in training them (Section \ref{['sec:rep-learn-dsr:mmd']}).
Figure 3: Distributional successor measure predictions in Windy Gridworld. (\ref{['fig:exp:windy:srdsr']}): Figures in the left column show the model atoms predicted by the distributional SM (distinguished by color) and by an ensemble of $\gamma$-models. Figures in the right column show the mean over distributional SM model atoms and the SM itself. (\ref{['fig:exp:windy:policysel']}): Distributional SM predictions of return statistics on held-out reward functions for two policies, $\pi_1,\pi_2$. For each reward function, the distributional SM correctly ranks policies with respect to both mean and CVaR.
Figure 4: Top: Kernel density estimate of distributional SM. Red dot represents the standard SR. Bottom: Kernel density estimates of return distributions, obtained via distributional SM. Vertical lines represent expected return, obtained from standard SR.
Figure 5: Monte Carlo estimation of the distributional SM at states $x_0$, $x_1$, and $x_2$, in a three-state MDP. Each distribution is supported on a copy of the fractal Sierpiński triangle. Red dot represents the standard SR.
...and 4 more figures

Theorems & Definitions (21)

Definition 3.1: Random occupancy measure
Proposition 3.1
Remark 3.2
Definition 3.3: Distributional successor measure
Proposition 3.3
Proposition 3.3: Contractivity of $\mathcal{T}^\pi$
Corollary 3.3: Convergent Dynamic Programming
Proposition 2.0
proof
Proposition 2.0
...and 11 more

A Distributional Analogue to the Successor Representation

TL;DR

Abstract

A Distributional Analogue to the Successor Representation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (21)