Table of Contents
Fetching ...

$\pi2\text{vec}$: Policy Representations with Successor Features

Gianluca Scarpellini, Ksenia Konyushkova, Claudio Fantacci, Tom Le Paine, Yutian Chen, Misha Denil

TL;DR

π2vec addresses the costly process of policy evaluation in robotics by learning offline, task-agnostic representations of black-box policies. It constructs policy embeddings Ψ_π^φ through a three-step pipeline that uses a policy-agnostic encoder φ, a policy-specific successor-feature encoder ψ_π^φ learned via offline FQE, and an aggregation over canonical states, followed by a supervised performance predictor. The approach demonstrates superior offline policy ranking and selection across multiple real and simulated domains, and it highlights the importance of choosing an appropriate foundation-model encoder φ. By enabling fully offline policy selection and leveraging diverse foundation models, π2vec offers a scalable, data-efficient tool for policy evaluation in resource-constrained robotic settings.

Abstract

This paper describes $\pi2\text{vec}$, a method for representing behaviors of black box policies as feature vectors. The policy representations capture how the statistics of foundation model features change in response to the policy behavior in a task agnostic way, and can be trained from offline data, allowing them to be used in offline policy selection. This work provides a key piece of a recipe for fusing together three modern lines of research: Offline policy evaluation as a counterpart to offline RL, foundation models as generic and powerful state representations, and efficient policy selection in resource constrained environments.

$\pi2\text{vec}$: Policy Representations with Successor Features

TL;DR

π2vec addresses the costly process of policy evaluation in robotics by learning offline, task-agnostic representations of black-box policies. It constructs policy embeddings Ψ_π^φ through a three-step pipeline that uses a policy-agnostic encoder φ, a policy-specific successor-feature encoder ψ_π^φ learned via offline FQE, and an aggregation over canonical states, followed by a supervised performance predictor. The approach demonstrates superior offline policy ranking and selection across multiple real and simulated domains, and it highlights the importance of choosing an appropriate foundation-model encoder φ. By enabling fully offline policy selection and leveraging diverse foundation models, π2vec offers a scalable, data-efficient tool for policy evaluation in resource-constrained robotic settings.

Abstract

This paper describes , a method for representing behaviors of black box policies as feature vectors. The policy representations capture how the statistics of foundation model features change in response to the policy behavior in a task agnostic way, and can be trained from offline data, allowing them to be used in offline policy selection. This work provides a key piece of a recipe for fusing together three modern lines of research: Offline policy evaluation as a counterpart to offline RL, foundation models as generic and powerful state representations, and efficient policy selection in resource constrained environments.
Paper Structure (38 sections, 5 equations, 4 figures, 10 tables)

This paper contains 38 sections, 5 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: $\pi2\text{vec}$ method relies on the successor feature framework, that we adopt in combination with a dataset of offline demonstrations and a visual foundation model $\phi$. $\pi2\text{vec}$ represents each policy $\pi_i$ as a feature vector $\Psi_{\pi_i}^\phi \in \mathbb{R}^n$. $\Psi_{\pi_i}^\phi$ encodes the expected behavior of a policy when deployed on an agent.
  • Figure 2: Given a trajectory from the dataset of offline demonstrations, we train successor feature $\psi_\pi^\phi(s_t)$ to predict the discounted sum of features $\sum_i \gamma^i \phi(s_{t+i})$, where $\phi$ is a visual feature extractor and $\pi$ is a policy. Intuitively, $\phi(s_i)$ represents semantic changes in the current state of the environment $s_i$, while successor feature $\psi_\pi^\phi(s_t)$ summarizes all future features encoded by $\phi$ if actions came from policy $\pi$.
  • Figure 3: We adopt 5 environments. (i) Kitchen: 5 tasks (Knob-on, Left door open, light on, microwave open, and right door open) and 3 points of views. (ii) Metaworld: 4 tasks (assembly, button press, bin picking, and drawer open) and 3 points of views. (iii) Insert gear in simulation (iii) and (iv) on a real robot. (v) RGB stacking on a real robot.
  • Figure 4: We implement $\psi^\phi_\pi$ as a neural-network. First, we encode state $s_t$--consisting observations and proprioception--and policy actions $\pi(s_t)$ into feature vectors. Next, we concatenate the features and input the resulting vector to a multi-layer perceptron. $\psi^\phi_\pi$ outputs a vector of $B\times N$ dimensions, where $B$ number of bins of the distribution and $N$ is the dimension of the feature vector $\phi(s_t)$. We reshape the output into a matrix, where each row $i$ represents a histogram of probabilities of size $B$ for the successor feature $\psi_i$.