Distributional Successor Features Enable Zero-Shot Policy Optimization
Chuning Zhu, Xinqi Wang, Tyler Han, Simon S. Du, Abhishek Gupta
TL;DR
DiSPOs introduce a transferable world-model class that learns the distribution of successor features $p(\psi|s)$ from offline data and a readout policy $\pi(a|s,\psi)$ to realize outcomes. Zero-shot policy optimization reduces to linear regression over $p(\psi|s)$ to infer reward weights and to selecting the best $\psi$ under dataset support, with planning carried out via diffusion-based guidance to avoid autoregressive errors. Theoretical analysis provides error and suboptimality bounds, while empirical evaluation across long-horizon robotic domains shows strong cross-task transfer, the ability to handle arbitrary reward functions, and trajectory stitching without test-time policy optimization. This framework offers a practical, scalable approach to offline multitask RL, enabling rapid adaptation with minimal online computation and robust performance across diverse reward structures.
Abstract
Intelligent agents must be generalists, capable of quickly adapting to various tasks. In reinforcement learning (RL), model-based RL learns a dynamics model of the world, in principle enabling transfer to arbitrary reward functions through planning. However, autoregressive model rollouts suffer from compounding error, making model-based RL ineffective for long-horizon problems. Successor features offer an alternative by modeling a policy's long-term state occupancy, reducing policy evaluation under new rewards to linear regression. Yet, zero-shot policy optimization for new tasks with successor features can be challenging. This work proposes a novel class of models, i.e., Distributional Successor Features for Zero-Shot Policy Optimization (DiSPOs), that learn a distribution of successor features of a stationary dataset's behavior policy, along with a policy that acts to realize different successor features achievable within the dataset. By directly modeling long-term outcomes in the dataset, DiSPOs avoid compounding error while enabling a simple scheme for zero-shot policy optimization across reward functions. We present a practical instantiation of DiSPOs using diffusion models and show their efficacy as a new class of transferable models, both theoretically and empirically across various simulated robotics problems. Videos and code available at https://weirdlabuw.github.io/dispo/.
