Unsupervised Zero-Shot Reinforcement Learning via Functional Reward Encodings

Kevin Frans; Seohong Park; Pieter Abbeel; Sergey Levine

Unsupervised Zero-Shot Reinforcement Learning via Functional Reward Encodings

Kevin Frans, Seohong Park, Pieter Abbeel, Sergey Levine

TL;DR

The paper introduces Functional Reward Encoding (FRE), a framework to pretrain a generalist, zero-shot RL agent from unlabeled offline trajectories by learning a latent encoding of arbitrary reward functions. A transformer-based variational encoder maps samples of (state, reward) pairs into a latent z, enabling a decoder to predict rewards and a downstream policy to maximize rewards conditioned on z. By training on a diverse, domain-agnostic prior of random rewards and using an offline RL objective, FRE achieves competitive results on standard offline RL benchmarks and demonstrates robust zero-shot transfer to unseen tasks with minimal reward information. This approach offers a scalable path to generalist agents that can rapidly adapt to new objectives without task-specific labels or online fine-tuning, with practical impact in robotics and beyond.

Abstract

Can we pre-train a generalist agent from a large amount of unlabeled offline trajectories such that it can be immediately adapted to any new downstream tasks in a zero-shot manner? In this work, we present a functional reward encoding (FRE) as a general, scalable solution to this zero-shot RL problem. Our main idea is to learn functional representations of any arbitrary tasks by encoding their state-reward samples using a transformer-based variational auto-encoder. This functional encoding not only enables the pre-training of an agent from a wide diversity of general unsupervised reward functions, but also provides a way to solve any new downstream tasks in a zero-shot manner, given a small number of reward-annotated samples. We empirically show that FRE agents trained on diverse random unsupervised reward functions can generalize to solve novel tasks in a range of simulated robotic benchmarks, often outperforming previous zero-shot RL and offline RL methods. Code for this project is provided at: https://github.com/kvfrans/fre

Unsupervised Zero-Shot Reinforcement Learning via Functional Reward Encodings

TL;DR

Abstract

Paper Structure (20 sections, 6 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 20 sections, 6 equations, 9 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Preliminaries and Problem Setting
Unsupervised Zero-Shot RL via Functional Reward Encodings
Functional Reward Encoding
Random Functions as a Prior Reward Distribution
Offline RL with FRE
Experiments
Do FRE encodings trained on random reward functions zero-shot transfer to unseen test tasks?
How does FRE perform on zero-shot offline RL benchmarks, compared to prior methods?
What are the scaling properties of FRE as the space of random rewards increases?
Can prior domain knowledge be used to increase the specificity of the FRE encoding?
Discussion
Hyperparameters
Training Details
...and 5 more sections

Figures (9)

Figure 1: FRE discovers latent representations over random unsupervised reward functions. At evaluation, user-given downstream objectives can be encoded into the latent space to enable zero-shot policy execution. FRE utilizes simple building blocks and is a data-scalable way to learn general capabilities from unlabeled offline trajectory data.
Figure 2: FRE encodes a reward function by evaluating its output over a random set of data states. Given a sampled reward function $\eta$, the reward function is first evaluated on a set of random encoder states from the offline dataset. The $(s,\eta(s))$ pairs are then passed into a permutation-invariant transformer encoder, which produces a latent task embedding $z$. A decoder head is then optimized to minimize the mean-squared error between the true reward and the predicted reward on a set of decoder states. The encoder-decoder structure is trained jointly, and $z$ can be utilized for downstream learning of task-conditioned policies and value functions.
Figure 3: After unsupervised pretraining, FRE can solve user-specified downstream tasks without additional fine-tuning. Shown above are examples of reward functions sampled from various evaluations in AntMaze. Columns: 1) True reward function projected onto maze. 2) Random states used for encoding shown in non-black. 3) Reward predicted by decoder network. 4) Behavior of FRE policy conditioned on latent encoding. Agents start at the red dot. 5) Visualization of predicted value function.
Figure 4: Evaluation domains: AntMaze, ExORL, and Kitchen.
Figure 5: The general capabilities of a FRE agent scales with diversity of random functions used in training. FRE-all represents an agent trained on a uniform mixture of three random reward families, while each other column represents a specific agent trained on only a subset of the three. The robust FRE-all agent displays the largest total score, and competitive performance among all evaluation tasks, showing that the FRE encoding can combine reward function distributions without losing performance.
...and 4 more figures

Unsupervised Zero-Shot Reinforcement Learning via Functional Reward Encodings

TL;DR

Abstract

Unsupervised Zero-Shot Reinforcement Learning via Functional Reward Encodings

Authors

TL;DR

Abstract

Table of Contents

Figures (9)