Table of Contents
Fetching ...

Distributional Successor Features Enable Zero-Shot Policy Optimization

Chuning Zhu, Xinqi Wang, Tyler Han, Simon S. Du, Abhishek Gupta

TL;DR

DiSPOs introduce a transferable world-model class that learns the distribution of successor features $p(\psi|s)$ from offline data and a readout policy $\pi(a|s,\psi)$ to realize outcomes. Zero-shot policy optimization reduces to linear regression over $p(\psi|s)$ to infer reward weights and to selecting the best $\psi$ under dataset support, with planning carried out via diffusion-based guidance to avoid autoregressive errors. Theoretical analysis provides error and suboptimality bounds, while empirical evaluation across long-horizon robotic domains shows strong cross-task transfer, the ability to handle arbitrary reward functions, and trajectory stitching without test-time policy optimization. This framework offers a practical, scalable approach to offline multitask RL, enabling rapid adaptation with minimal online computation and robust performance across diverse reward structures.

Abstract

Intelligent agents must be generalists, capable of quickly adapting to various tasks. In reinforcement learning (RL), model-based RL learns a dynamics model of the world, in principle enabling transfer to arbitrary reward functions through planning. However, autoregressive model rollouts suffer from compounding error, making model-based RL ineffective for long-horizon problems. Successor features offer an alternative by modeling a policy's long-term state occupancy, reducing policy evaluation under new rewards to linear regression. Yet, zero-shot policy optimization for new tasks with successor features can be challenging. This work proposes a novel class of models, i.e., Distributional Successor Features for Zero-Shot Policy Optimization (DiSPOs), that learn a distribution of successor features of a stationary dataset's behavior policy, along with a policy that acts to realize different successor features achievable within the dataset. By directly modeling long-term outcomes in the dataset, DiSPOs avoid compounding error while enabling a simple scheme for zero-shot policy optimization across reward functions. We present a practical instantiation of DiSPOs using diffusion models and show their efficacy as a new class of transferable models, both theoretically and empirically across various simulated robotics problems. Videos and code available at https://weirdlabuw.github.io/dispo/.

Distributional Successor Features Enable Zero-Shot Policy Optimization

TL;DR

DiSPOs introduce a transferable world-model class that learns the distribution of successor features from offline data and a readout policy to realize outcomes. Zero-shot policy optimization reduces to linear regression over to infer reward weights and to selecting the best under dataset support, with planning carried out via diffusion-based guidance to avoid autoregressive errors. Theoretical analysis provides error and suboptimality bounds, while empirical evaluation across long-horizon robotic domains shows strong cross-task transfer, the ability to handle arbitrary reward functions, and trajectory stitching without test-time policy optimization. This framework offers a practical, scalable approach to offline multitask RL, enabling rapid adaptation with minimal online computation and robust performance across diverse reward structures.

Abstract

Intelligent agents must be generalists, capable of quickly adapting to various tasks. In reinforcement learning (RL), model-based RL learns a dynamics model of the world, in principle enabling transfer to arbitrary reward functions through planning. However, autoregressive model rollouts suffer from compounding error, making model-based RL ineffective for long-horizon problems. Successor features offer an alternative by modeling a policy's long-term state occupancy, reducing policy evaluation under new rewards to linear regression. Yet, zero-shot policy optimization for new tasks with successor features can be challenging. This work proposes a novel class of models, i.e., Distributional Successor Features for Zero-Shot Policy Optimization (DiSPOs), that learn a distribution of successor features of a stationary dataset's behavior policy, along with a policy that acts to realize different successor features achievable within the dataset. By directly modeling long-term outcomes in the dataset, DiSPOs avoid compounding error while enabling a simple scheme for zero-shot policy optimization across reward functions. We present a practical instantiation of DiSPOs using diffusion models and show their efficacy as a new class of transferable models, both theoretically and empirically across various simulated robotics problems. Videos and code available at https://weirdlabuw.github.io/dispo/.
Paper Structure (40 sections, 3 theorems, 9 equations, 7 figures, 9 tables, 5 algorithms)

This paper contains 40 sections, 3 theorems, 9 equations, 7 figures, 9 tables, 5 algorithms.

Key Result

Theorem 5.3

For any MDP $\mathcal{M}$ and $\epsilon$-good outcome distribution $\hat{p}$, the policy $\widehat{\pi}$ given by the random shooting planner with sampling optimality $\tau$ is a $(\epsilon + \tau, \pi_\beta)$-good policy.

Figures (7)

  • Figure 1: The transfer setting for DiSPOs. Given an unlabeled offline dataset, DiSPOs model both "what can happen?" $p(\psi|s)$ and "how can we achieve a particular outcome?" $p(a|s, \psi)$. This is used for quick adaptation to new downstream tasks without test-time policy optimization.
  • Figure 2: DiSPOs for a simple environment. Given a state feature function $\phi$, DiSPOs learn a distribution of all possible long-term outcomes (successor features $\psi$) in the dataset $p(\psi|s)$, along with a readout policy $\pi(a|s, \psi)$ that takes an action $a$ to realise $\psi$ starting at state $s$.
  • Figure 3: Zero-shot policy optimization with DiSPOs. Once a DiSPO is learned, the optimal action can be obtained by performing reward regression and searching for the optimal outcome under the dynamics to decode via the policy.
  • Figure 4: Evaluation domains: (1) D4RL Antmaze fu2020d4rl (2) Franka Kitchen fu2020d4rl (3) Hopper chen23ramp (4) Preference-Based Antmaze with the goal of taking a particular path (5) Roboverse singh2020cog robotic manipulation.
  • Figure 5: Transfer across tasks with DiSPOs and COMBO yu2021combo in medium antmaze. Each tile corresponds to a different task, with color of the tile indicating the normalized return. DiSPOs successfully transfer across a majority of tasks, while MBRL yu2021combo struggles on tasks that are further away from the initial location.
  • ...and 2 more figures

Theorems & Definitions (7)

  • Definition 5.2
  • Theorem 5.3: main theorem
  • Corollary 5.4
  • Theorem 5.5
  • proof
  • proof
  • proof