Table of Contents
Fetching ...

Proto Successor Measure: Representing the Behavior Space of an RL Agent

Siddhant Agarwal, Harshit Sikchi, Peter Stone, Amy Zhang

TL;DR

Proto Successor Measure (PSM) addresses zero-shot RL by pretraining a reward-free, policy-agnostic basis for the entire behavior space of an MDP. The method learns basis functions $\Phi$ and a bias $b$ such that any policy's successor measure $M^\pi$ can be written as $M^\pi = \sum_i \phi_i w_i^\pi + b$, enabling test-time optimization by solving a constrained linear program over weights $w^\pi$. A discrete codebook of deterministic policies allows turning the bilevel optimization into a single-player objective, while fast inference uses a Lagrangian dual to enforce $\Phi w + b \ge 0$ and recover $Q^*$ and $\pi^*$. Empirically, PSM achieves accurate zero-shot value predictions and near-optimal policies in gridworld, manipulation, and continuous-control benchmarks, often outperforming Laplacian, FB, and HILP baselines. This work provides a principled, scalable representation for transferring knowledge across downstream tasks without additional environment interactions.

Abstract

Having explored an environment, intelligent agents should be able to transfer their knowledge to most downstream tasks within that environment without additional interactions. Referred to as "zero-shot learning", this ability remains elusive for general-purpose reinforcement learning algorithms. While recent works have attempted to produce zero-shot RL agents, they make assumptions about the nature of the tasks or the structure of the MDP. We present Proto Successor Measure: the basis set for all possible behaviors of a Reinforcement Learning Agent in a dynamical system. We prove that any possible behavior (represented using visitation distributions) can be represented using an affine combination of these policy-independent basis functions. Given a reward function at test time, we simply need to find the right set of linear weights to combine these bases corresponding to the optimal policy. We derive a practical algorithm to learn these basis functions using reward-free interaction data from the environment and show that our approach can produce the optimal policy at test time for any given reward function without additional environmental interactions. Project page: https://agarwalsiddhant10.github.io/projects/psm.html.

Proto Successor Measure: Representing the Behavior Space of an RL Agent

TL;DR

Proto Successor Measure (PSM) addresses zero-shot RL by pretraining a reward-free, policy-agnostic basis for the entire behavior space of an MDP. The method learns basis functions and a bias such that any policy's successor measure can be written as , enabling test-time optimization by solving a constrained linear program over weights . A discrete codebook of deterministic policies allows turning the bilevel optimization into a single-player objective, while fast inference uses a Lagrangian dual to enforce and recover and . Empirically, PSM achieves accurate zero-shot value predictions and near-optimal policies in gridworld, manipulation, and continuous-control benchmarks, often outperforming Laplacian, FB, and HILP baselines. This work provides a principled, scalable representation for transferring knowledge across downstream tasks without additional environment interactions.

Abstract

Having explored an environment, intelligent agents should be able to transfer their knowledge to most downstream tasks within that environment without additional interactions. Referred to as "zero-shot learning", this ability remains elusive for general-purpose reinforcement learning algorithms. While recent works have attempted to produce zero-shot RL agents, they make assumptions about the nature of the tasks or the structure of the MDP. We present Proto Successor Measure: the basis set for all possible behaviors of a Reinforcement Learning Agent in a dynamical system. We prove that any possible behavior (represented using visitation distributions) can be represented using an affine combination of these policy-independent basis functions. Given a reward function at test time, we simply need to find the right set of linear weights to combine these bases corresponding to the optimal policy. We derive a practical algorithm to learn these basis functions using reward-free interaction data from the environment and show that our approach can produce the optimal policy at test time for any given reward function without additional environmental interactions. Project page: https://agarwalsiddhant10.github.io/projects/psm.html.

Paper Structure

This paper contains 35 sections, 7 theorems, 28 equations, 7 figures, 4 tables.

Key Result

Theorem 4.1

All possible state-action visitation distributions in an MDP form an affine set.

Figures (7)

  • Figure 1: Method Overview: Visitation distributions corresponding to any policy must obey the Bellman Flow constraint for the dynamical system. This means they must lie on the plane defined by the the Bellman Flow equation. Being a plane, it can be represented using a basis set $\Phi$ and a bias. All valid (non negative) visitation distributions lie within a convex hull on this plane. The boundary of this hull is defined using the non negativity constraints: $\Phi w + b \geq 0$. Each point within this convex hull corresponds to a visitation distribution for a valid policy and is defined simply by the "coordinate" $w$.
  • Figure 1: Table shows comparison (over 5 seeds) of zero-shot RL performance between different methods with representation size of $d=128$. PSM demonstrates a marked improvement over prior methods. (*) denotes statistically significant through Mann-Whitney U Test with level $0.05$.
  • Figure 2: (left) A Toy MDP with 2 states and 2 actions to depict how the linear program of RL is reduced using precomputation. (right) The corresponding simplex for $w$ assuming the initial state distribution is $\mu = (1, 0)^T$.
  • Figure 3: Qualitative results on a gridworld and four-room: G denotes the goal sampled for every episode. The black regions are the boundaries/obstacles. The agent needs to navigate across the grid and through the small opening (in case of four-room) to reach the goal. We visualize the optimal Q-functions inferred at test time for the given goal in the image. The arrows denote the optimal policy. (Top row) Results for PSM, (Middle Row) Results for FB, (Bottom row) Results for Laplacian Eigenfunctions.
  • Figure 4: Quantitative results on FetchReach: The success rates (averaged over 3 seeds) are plotted (along with the standard deviation as shaded) with respect to the training updates for PSM, FB and Laplacian. PSM quickly reaches optimal performance while FB shows instability in maintaining its optimality. Laplacian is far from the optimal performance.
  • ...and 2 more figures

Theorems & Definitions (12)

  • Theorem 4.1
  • Corollary 4.2
  • Corollary 4.3
  • Theorem 4.4
  • Theorem 6.1
  • Lemma 6.2
  • Theorem 6.3
  • proof
  • proof
  • proof
  • ...and 2 more