Table of Contents
Fetching ...

Hierarchical Successor Representation for Robust Transfer

Changmin Yu, Máté Lengyel

TL;DR

The Hierarchical Successor Representation is proposed, a policy-agnostic hierarchical map that effectively bridges model-free optimality and model-based flexibility and can be leveraged to drive efficient exploration, effectively scaling to large, procedurally generated environments.

Abstract

The successor representation (SR) provides a powerful framework for decoupling predictive dynamics from rewards, enabling rapid generalisation across reward configurations. However, the classical SR is limited by its inherent policy dependence: policies change due to ongoing learning, environmental non-stationarities, and changes in task demands, making established predictive representations obsolete. Furthermore, in topologically complex environments, SRs suffer from spectral diffusion, leading to dense and overlapping features that scale poorly. Here we propose the Hierarchical Successor Representation (HSR) for overcoming these limitations. By incorporating temporal abstractions into the construction of predictive representations, HSR learns stable state features which are robust to task-induced policy changes. Applying non-negative matrix factorisation (NMF) to the HSR yields a sparse, low-rank state representation that facilitates highly sample-efficient transfer to novel tasks in multi-compartmental environments. Further analysis reveals that HSR-NMF discovers interpretable topological structures, providing a policy-agnostic hierarchical map that effectively bridges model-free optimality and model-based flexibility. Beyond providing a useful basis for task-transfer, we show that HSR's temporally extended predictive structure can also be leveraged to drive efficient exploration, effectively scaling to large, procedurally generated environments.

Hierarchical Successor Representation for Robust Transfer

TL;DR

The Hierarchical Successor Representation is proposed, a policy-agnostic hierarchical map that effectively bridges model-free optimality and model-based flexibility and can be leveraged to drive efficient exploration, effectively scaling to large, procedurally generated environments.

Abstract

The successor representation (SR) provides a powerful framework for decoupling predictive dynamics from rewards, enabling rapid generalisation across reward configurations. However, the classical SR is limited by its inherent policy dependence: policies change due to ongoing learning, environmental non-stationarities, and changes in task demands, making established predictive representations obsolete. Furthermore, in topologically complex environments, SRs suffer from spectral diffusion, leading to dense and overlapping features that scale poorly. Here we propose the Hierarchical Successor Representation (HSR) for overcoming these limitations. By incorporating temporal abstractions into the construction of predictive representations, HSR learns stable state features which are robust to task-induced policy changes. Applying non-negative matrix factorisation (NMF) to the HSR yields a sparse, low-rank state representation that facilitates highly sample-efficient transfer to novel tasks in multi-compartmental environments. Further analysis reveals that HSR-NMF discovers interpretable topological structures, providing a policy-agnostic hierarchical map that effectively bridges model-free optimality and model-based flexibility. Beyond providing a useful basis for task-transfer, we show that HSR's temporally extended predictive structure can also be leveraged to drive efficient exploration, effectively scaling to large, procedurally generated environments.
Paper Structure (16 sections, 1 theorem, 17 equations, 8 figures, 2 algorithms)

This paper contains 16 sections, 1 theorem, 17 equations, 8 figures, 2 algorithms.

Key Result

Theorem 3.1

Let $\mathcal{T}$ be the HSR Bellman operator defined by Equation eq: hsr_bellman. For any discount factor $\gamma < 1$ and option durations $\tau \geq 1$, $\mathcal{T}$ is a contraction mapping with respect to the max-norm (proof can be found in sec: bellman_contraction_proof). for any $\mathcal{M}$ and $\mathcal{M}'$.

Figures (8)

  • Figure 1: Temporal abstraction yields hierarchical successor representation.a. Schematic of the computational process underlying the construction of hierarchical successor representations, and corresponding low-dimensional basis through NMF($\hbox{\boldmath$\mathsf{\Phi}$}$) and singular value decomposition (SVD; $\hbox{\boldmath$\mathsf{V}$}$). b. Exemplar pretraining regimes ($G$). c. SR matrix corresponding to the random-walk policy (RW-SR). Note that state indices were permuted to respect the topological structure of the four-room environment (indicated by gray dashed lines). d. Principal eigenvectors (ranked by eigenvalues) of the RW-SR matrix (top), as well as their corresponding eigenoption-specific policies (bottom). e. Expected SR matrix (eSR; $\bar{\hbox{\boldmath$\mathsf{M}$}}$). f. Expected HSR matrix (eHSR; $\bar{\hbox{\boldmath$\mathsf{\mathcal{}}$}{M}}$). g. Two-dimensional t-SNE projections maaten2008visualizing of the row-space of RW-SR (left), eSR (middle), and eHSR (right). h. Principal NMF basis vectors (ranked by basis norm) of the eHSR matrix. i. Principal eigenvectors of the expected eSR matrix. j. Same as j, but for the RW-SR matrix. Note that all presented basis vectors were normalised by their corresponding maximum absolute values, hence leaving their signs invariant.
  • Figure 2: HSR provides a stable state representation and enables sample-efficient transfer across tasks with shared transition dynamics.a. Exemplar four-room environment, with a fixed start location and two different goal locations. b. Training curves (number of steps to reach the goal location; mean $\pm$ s.e.) for Q-learning agents with linear function approximation, given different state representations (left: one-hot representation; middle: rows of SR matrices; right: rows of HSR matrices). All agents were firstly trained to reach $G_1$, and subsequently transferred to the new task with goal location $G_2$. All state representations (apart from one-hot representation) were simultaneously trained with the value function. Dashed horizontal red and magenta lines indicate optimal number of steps to reach $G_1$ ($10$) and $G_2$ ($18$) from the shared start state. c. Number of training episodes to reach optimal performance in the $G_1$ and $G_2$ tasks, for all agents in b (number set of $200$ if agents fail to reach optimal performance within $200$ episodes). d. Transfer efficiency (normalised ratio between number of training episodes to reach optimal performance in $G_1$- and $G_2$-tasks, see text) for SR- and HSR-based agents (two-sided two-sample t-test; $p = 0.008$, $df=38$). e. SR matrices (in log-scale for visual clarity) after the corresponding agent were trained to reach optimal performance in $G_1$ (left) and $G_2$ (right) tasks. Note that rows and columns of SR matrices are permuted to restore the local topological structure of the environment (Figure \ref{['fig: four_room_motivation']}b). f. Same as e, but for HSR matrices. Note that we omit showing diagonal elements for visual clarity. g. Degrees of change in predictive representation $\left(\frac{||M_1 - M_2||^2_F}{||M_1||^2_F}\right)$ after agents were trained in $G_1$ and $G_2$ tasks, for SR and HSR matrices, respectively (two-sided two-sample t-test; $p < 0.001$, $\text{df} = 38$). h. Reconstruction $R^2$ scores of ground-truth optimal value functions (computed via dynamic programming) for $G_1$ and $G_2$ tasks given varying number of SR/HSR basis after agents are trained to reach optimal performance in $G_1$ tasks.
  • Figure 3: NMF basis of HSR supports sample-efficient transfer.a. Training curves (left) and number of training episodes to reach optimal performance (right) in $G_1$ tasks (Figure \ref{['fig:\n four_room_row_features_main']}a) for Q-learning agents with linear function approximation, given different low-dimensional basis as state representations. All agents were assumed to have received necessary pretraining for constructing base matrices (SR/HSR) before corresponding low-dimensional basis were extracted. Gray dotted line indicates optimal number of steps to reach $G_1$. b. Number of training episodes required to reach optimal performance for all agents in a (two-sided Wilcoxon signed-rank test; $\text{RW-SR}_{\text{SVD}}$ vs $\text{HSR}_{\text{NMF}}$: $p = 7.91\times 10^{-17}$; $\text{ESR}_{\text{SVD}}$ vs $\text{HSR}_{\text{NMF}}$: $p = 2.93\times 10^{-9}$; $\text{HSR}_{\text{SVD}}$ vs $\text{HSR}_{\text{NMF}}$: $p = 6.73\times10^{-10}$; $\text{HSR}_{\text{NMF}}$ vs $\text{HSR}_{\text{Row}}$: $p = 4.59\times10^{-18}$; $N_1 = 20$ and $N_2 = 20$ for all tests). Gray dashed horizontal line indicates the same number for the baseline agent with one-hot state encoding (shaded area indicates s.e.).
  • Figure 4: HSR-NMF basis yield a sparse, robust, and interpretable state representation.a. Example trajectory in the four-room environment. b. Activation (normalised) of all basis at each timestep along the example trajectory for $\text{eSR}_{\text{SVD}}$, $\text{eSR}_{\text{NMF}}$, $\text{HSR}_{\text{SVD}}$, $\text{HSR}_{\text{NMF}}$ (from left to right). Gray numbers below the rightmost panel indicates which room the corresponding trajectory segment is in. c. Reconstruction mean-squared error of predictive representations given corresponding low-dimensional features, as functions of varying number of bases. d. Reconstruction $R^2$ score of optimal value functions with respect to randomly selected goal locations, given different state features with varying basis size. e. Relative bottleneck activation (mean activation at bottleneck states / mean activation at non-bottleneck states) of low-dimensional features of SR and HSR.
  • Figure 5: Hierarchical temporal abstraction enables scalable intrinsically motivated exploration.a. Exemplar procedurally generated random maze environment. b. Learning curves (mean $\pm$ s.e.) of different agents (see main text) in terms of pure exploration (in the absence of extrinsic reward; left) and goal-directed navigation (with only non-zero reward at randomly selected goal locations; right). c. Asymptotic state coverage (after $10^5$ interaction steps; mean $\pm$ s.e.) for SR-SPIE and HSR-SPIE agents, as a function of maze size.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Theorem 3.1: Contraction of HSR Bellman Operator
  • proof