Table of Contents
Fetching ...

Which Features are Best for Successor Features?

Yann Ollivier

TL;DR

This work addresses how to choose base features for universal successor features (SFs) to maximize downstream performance in zero-shot RL. It introduces three reward models (Gaussian, goal-reaching, and scattered rewards) and derives an objective-based criterion for optimal features, valid without assuming the downstream tasks lie in the linear span of the features. The main result identifies the optimal features as the largest eigenfunctions of a specific operator involving the inverse of the regularized Bellman operator, with deterministic and γ-regime special cases: for γ = 1, the largest eigenfunctions of $\Delta^{-1}+(\Delta^{-1})^{\ast}$; for γ = 0, the smallest eigenfunctions of $P_{\pi_0}^{\ast}P_{\pi_0}$; and, in general environments, the top eigenfunctions of a reward-advantage matrix. These findings explain why Laplacian-based and forward-backward representations often perform well in practice, while clarifying their limitations. The paper also shows that the average Bellman gap is uninformative for selecting SF features and discusses the role of KL-regularized policy gradients in linking theory to practice, while noting the absence of explicit algorithms to compute the optimal features. Overall, the results provide a principled foundation for selecting SF bases and offer insights into the relationship between spectral properties and zero-shot RL performance.

Abstract

In reinforcement learning, universal successor features (SFs) are a way to provide zero-shot adaptation to new tasks at test time: they provide optimal policies for all downstream reward functions lying in the linear span of a set of base features. But it is unclear what constitutes a good set of base features, that could be useful for a wide set of downstream tasks beyond their linear span. Laplacian eigenfunctions (the eigenfunctions of $Δ+Δ^\ast$ with $Δ$ the Laplacian operator of some reference policy and $Δ^\ast$ that of the time-reversed dynamics) have been argued to play a role, and offer good empirical performance. Here, for the first time, we identify the optimal base features based on an objective criterion of downstream performance, in a non-tautological way without assuming the downstream tasks are linear in the features. We do this for three generic classes of downstream tasks: reaching a random goal state, dense random Gaussian rewards, and random ``scattered'' sparse rewards. The features yielding optimal expected downstream performance turn out to be the \emph{same} for these three task families. They do not coincide with Laplacian eigenfunctions in general, though they can be expressed from $Δ$: in the simplest case (deterministic environment and decay factor $γ$ close to $1$), they are the eigenfunctions of $Δ^{-1}+(Δ^{-1})^\ast$. We obtain these results under an assumption of large behavior cloning regularization with respect to a reference policy, a setting often used for offline RL. Along the way, we get new insights into KL-regularized\option{natural} policy gradient, and into the lack of SF information in the norm of Bellman gaps.

Which Features are Best for Successor Features?

TL;DR

This work addresses how to choose base features for universal successor features (SFs) to maximize downstream performance in zero-shot RL. It introduces three reward models (Gaussian, goal-reaching, and scattered rewards) and derives an objective-based criterion for optimal features, valid without assuming the downstream tasks lie in the linear span of the features. The main result identifies the optimal features as the largest eigenfunctions of a specific operator involving the inverse of the regularized Bellman operator, with deterministic and γ-regime special cases: for γ = 1, the largest eigenfunctions of ; for γ = 0, the smallest eigenfunctions of ; and, in general environments, the top eigenfunctions of a reward-advantage matrix. These findings explain why Laplacian-based and forward-backward representations often perform well in practice, while clarifying their limitations. The paper also shows that the average Bellman gap is uninformative for selecting SF features and discusses the role of KL-regularized policy gradients in linking theory to practice, while noting the absence of explicit algorithms to compute the optimal features. Overall, the results provide a principled foundation for selecting SF bases and offer insights into the relationship between spectral properties and zero-shot RL performance.

Abstract

In reinforcement learning, universal successor features (SFs) are a way to provide zero-shot adaptation to new tasks at test time: they provide optimal policies for all downstream reward functions lying in the linear span of a set of base features. But it is unclear what constitutes a good set of base features, that could be useful for a wide set of downstream tasks beyond their linear span. Laplacian eigenfunctions (the eigenfunctions of with the Laplacian operator of some reference policy and that of the time-reversed dynamics) have been argued to play a role, and offer good empirical performance. Here, for the first time, we identify the optimal base features based on an objective criterion of downstream performance, in a non-tautological way without assuming the downstream tasks are linear in the features. We do this for three generic classes of downstream tasks: reaching a random goal state, dense random Gaussian rewards, and random ``scattered'' sparse rewards. The features yielding optimal expected downstream performance turn out to be the \emph{same} for these three task families. They do not coincide with Laplacian eigenfunctions in general, though they can be expressed from : in the simplest case (deterministic environment and decay factor close to ), they are the eigenfunctions of . We obtain these results under an assumption of large behavior cloning regularization with respect to a reference policy, a setting often used for offline RL. Along the way, we get new insights into KL-regularized\option{natural} policy gradient, and into the lack of SF information in the norm of Bellman gaps.

Paper Structure

This paper contains 20 sections, 13 theorems, 83 equations.

Key Result

Theorem 2

Let $r$ be any reward function, and let $Q^{\pi_0}_r$ be the $Q$-function of the reference policy for reward $r$. Let $\hat{Q}$ be any function on $S\times A$, and consider the policy $\pi=\mathrm{Bolt}_{\pi_0}(\hat{Q})$. When $T\to \infty$, the regularized return of policy $\pi$ satisfies

Theorems & Definitions (27)

  • Definition 1: (Regularized return)
  • Theorem 2: (Regularized return of Boltzmann policies)
  • Corollary 3
  • Definition 4: (Regularized successor features)
  • Proposition 5
  • Proposition 6: (Average Bellman gaps do not depend on the features)
  • Definition 7
  • Theorem 8: (Expected regularized return depending on the features)
  • Corollary 9: (Optimal features for regularized successor features)
  • Theorem 10: (Optimal features for $\gamma=1$ in a deterministic environment)
  • ...and 17 more