Predictive representations: building blocks of intelligence

Wilka Carvalho; Momchil S. Tomov; William de Cothi; Caswell Barry; Samuel J. Gershman

Predictive representations: building blocks of intelligence

Wilka Carvalho, Momchil S. Tomov, William de Cothi, Caswell Barry, Samuel J. Gershman

TL;DR

This paper argues that predictive representations, especially the successor representation and its extensions, provide an efficient and flexible foundation for intelligent behavior. It surveys three core constructs—SR, successor models, and successor features—and shows how they trade off planning flexibility, sample efficiency, and scalability to high-dimensional spaces. The work connects reinforcement learning theory to neuroscience and cognitive science, illustrating how predictive representations manifest in the hippocampus, spatial navigation, replay, and memory, while detailing learning algorithms, transfer, and multi-task capabilities. Overall, predictive representations enable rapid adaptation to changing rewards, scalable transfer across tasks, and principled integration of episodic memory with decision making, making them strong candidates as building blocks for general intelligence. The practical implications span AI applications in exploration, transfer, HRL, and multi-agent settings, as well as insights into brain function and cognitive processes.

Abstract

Adaptive behavior often requires predicting future events. The theory of reinforcement learning prescribes what kinds of predictive representations are useful and how to compute them. This paper integrates these theoretical ideas with work on cognition and neuroscience. We pay special attention to the successor representation (SR) and its generalizations, which have been widely applied both as engineering tools and models of brain function. This convergence suggests that particular kinds of predictive representations may function as versatile building blocks of intelligence.

Predictive representations: building blocks of intelligence

TL;DR

Abstract

Paper Structure (70 sections, 69 equations, 9 figures, 1 table)

This paper contains 70 sections, 69 equations, 9 figures, 1 table.

Introduction
Theory
The reinforcement learning problem
Classical solution methods
The successor representation
Successor models: a probabilistic perspective on the SR
Successor features: a feature-based generalization of the SR
Generalized policy improvement: adaptively combining policies
Option Keyboard: chaining together policies
Summary
Practical learning algorithms and associated challenges
Learning successor features
Discovering cumulants
Estimating successor features
Learning an estimator that can generalize across policies
...and 55 more sections

Figures (9)

Figure 1: Algorithmic solutions to the RL problem. An agent solving a three-armed maze (bottom) can adopt different classes of strategies (top). Model-based strategies (left) learn an internal model of the environment, including the transition function ($T$), the reward function ($R$), and (optionally) the features ($\boldsymbol{\phi}$). At decision time, the agent can run forward simulations to predict the outcomes of different actions. Model-free strategies (middle) learn action values ($Q$) and/or a policy ($\pi$). At decision time, the agent can consult the cached action values and/or policy in the current state. Strategies relying on predictive representations (right) learn the successor representation (SR) matrix ($\mathbf{M}$) mapping states to future states and/or the successor features ($\sf$) mapping states to future features, as well as the reward function ($R$). At decision time, the agent can consult the cached predictions and cross-reference them with its task (specified by the reward function) to choose an action.
Figure 2: Three kinds of predictive representations: the successor representation (§\ref{['sec:sr']}), the successor model (§\ref{['sec:successor-models']}), and successor features (§\ref{['sec:sf']}). Their computations are summarized in Table \ref{['table:sr-summary']}. Each of these predictive representations describes a state by a prediction of what will happen when a policy $\pi$ is followed. With the successor representation, one gets a description of how much all states will be visited in the near future when beginning at state $s$. One limitation of this is that it does not scale well to large state-spaces, since it is impractical to maintain predictions about all states. Successor models circumvent this challenge by framing learning as a density estimation problem. This enables scaling to high-dimensional state- and action-spaces (including continuous spaces) with amortized learning procedures (§\ref{['sec:learning-horizon']}). Successor features are another method for circumventing the challenge of representing large state-spaces. Here, we do so by describing states with a shared set of state-features and make predictions about how much features will be experienced. Both successor models and successor features have pros and cons. Successor models open up new possibilities like supporting temporally abstract sampling of future states under a policy. Additionally, algorithms for learning successor models typically subsume learning of state-features, whereas successor features typically need a separate mechanism for learning state features. On the other hand, successor features are easier to formulate and more readily enable stitching together policies concurrently (§\ref{['sec:sf-gpi']}) and sequentially (§\ref{['sec:ok']}) in time---though there is progress on doing this with successor models (see §\ref{['sec:transfer-advances']}).
Figure 3: The successor representation (SR). (a) A schematic of an environment where the agent is a red box at state $s_{13}$ and the goal is a green box at state $s_5$. In general, an SR $M^{\pi}(s, \tilde{s})$ (Eq. \ref{['eq:sr']}) describes the discounted state occupancy for state $\tilde{s}$ when beginning at state $s$ and following policy $\pi$. In panels (b-c), we showcase $M^{\pi}(s_{13}, \tilde{s})$ for a random policy and an optimal policy. (b) The SR under a random policy measures high state occupancy near the agent's current state (e.g., $M^{\pi}(s_{13}, s_{14}) = 5.97$) and low state occupancy at points further away from the agent (e.g., $M^{\pi}(s_{13}, s_{12}) = 0.16$). (c) The SR under the optimal policy has highest state occupancy along the shortest path to the goal (e.g., here $M^{\pi}(s_{13}, s_{12}) = .66$), fading as we get further from the current state. In contrast to a random policy, states not along that path have $0$ occupancy (e.g., $M^{\pi}(s_{13}, s_{19}) = 0.0$). Once we know a reward function, we can efficiently evaluate both policies (Eq. \ref{['eq:sr-value']}). (d) An example reward function that has a cost of $-0.1$ for each state except the goal state where reward is $1$. The SR allows us to efficiently compute (e) the value function under a random policy and (f) the value function under the optimal policy.
Figure 4: The Successor Model (SM). A cartoon schematic of a robot leg that can hop forward. Left: a single-step model can only compute likelihoods for states at the next time-step. Right: multi-step successor models can compute likelihoods for states over some horizon into the future. One key difference between the SM and the SR is that the SM defines a valid probability distribution. This means that we can leverage density estimation techniques for learning it over continuous state- and action-spaces. Additionally, as this figure suggests, we can use it to sample potentially distal states (see §\ref{['sec:mbrl']}). Adapted with permission from janner2020gamma.
Figure 5: A schematic of Successor features (SFs; §\ref{['sec:sf']}) and Generalized policy improvement (GPI; §\ref{['sec:sf-gpi']}). Note that we use the shorthand $\boldsymbol{\phi}_t = \boldsymbol{\phi}(s_t)$ to represents "state-features" that describe what is visible to the agent at time $t$. $\pi$ corresponds to policies that the agent knows how to perform. (a) Examples of SFs (Eq. \ref{['eq:sf-def']}) for the "open drawer" and "open fridge" policies. In this hypothetical scenario, the state-features that agent holds describe whether an apple, milk, fork, or knife are present. Beginning for the first time-step, the SFs for these policies encode predictions for which of these features will be present when the policies are executed---predicted to be present for apple and milk for the open fridge policy, and for the fork and knife when the open drawer policy is executed. (b) The agent can re-use these known policies with GPI (Eq. \ref{['eq:sf-gpi']}). When given a new task, say "get milk", it is able to leverage the SFs for the policies to knows to decide which behavior will enable it to get milk. In this example, the policy for opening the fridge will also lead to milk. The agent selects actions with GPI by computing Q-values for each known behavior as the dot-product between the current task and each known SF. The highest Q-value is then used to select actions. If the agent wants to execute the option keyboard (OK; §\ref{['sec:ok']}), they can adaptively set $\mathbf{w}$ based on the current state. For example, at some states the agent may want to pursue getting milk, while in others they may want to pursue getting a fork. Adapted from carvalho2023combining with permission.
...and 4 more figures

Predictive representations: building blocks of intelligence

TL;DR

Abstract

Predictive representations: building blocks of intelligence

Authors

TL;DR

Abstract

Table of Contents

Figures (9)