Table of Contents
Fetching ...

Unsupervised Zero-Shot Reinforcement Learning via Functional Reward Encodings

Kevin Frans, Seohong Park, Pieter Abbeel, Sergey Levine

TL;DR

The paper introduces Functional Reward Encoding (FRE), a framework to pretrain a generalist, zero-shot RL agent from unlabeled offline trajectories by learning a latent encoding of arbitrary reward functions. A transformer-based variational encoder maps samples of (state, reward) pairs into a latent z, enabling a decoder to predict rewards and a downstream policy to maximize rewards conditioned on z. By training on a diverse, domain-agnostic prior of random rewards and using an offline RL objective, FRE achieves competitive results on standard offline RL benchmarks and demonstrates robust zero-shot transfer to unseen tasks with minimal reward information. This approach offers a scalable path to generalist agents that can rapidly adapt to new objectives without task-specific labels or online fine-tuning, with practical impact in robotics and beyond.

Abstract

Can we pre-train a generalist agent from a large amount of unlabeled offline trajectories such that it can be immediately adapted to any new downstream tasks in a zero-shot manner? In this work, we present a functional reward encoding (FRE) as a general, scalable solution to this zero-shot RL problem. Our main idea is to learn functional representations of any arbitrary tasks by encoding their state-reward samples using a transformer-based variational auto-encoder. This functional encoding not only enables the pre-training of an agent from a wide diversity of general unsupervised reward functions, but also provides a way to solve any new downstream tasks in a zero-shot manner, given a small number of reward-annotated samples. We empirically show that FRE agents trained on diverse random unsupervised reward functions can generalize to solve novel tasks in a range of simulated robotic benchmarks, often outperforming previous zero-shot RL and offline RL methods. Code for this project is provided at: https://github.com/kvfrans/fre

Unsupervised Zero-Shot Reinforcement Learning via Functional Reward Encodings

TL;DR

The paper introduces Functional Reward Encoding (FRE), a framework to pretrain a generalist, zero-shot RL agent from unlabeled offline trajectories by learning a latent encoding of arbitrary reward functions. A transformer-based variational encoder maps samples of (state, reward) pairs into a latent z, enabling a decoder to predict rewards and a downstream policy to maximize rewards conditioned on z. By training on a diverse, domain-agnostic prior of random rewards and using an offline RL objective, FRE achieves competitive results on standard offline RL benchmarks and demonstrates robust zero-shot transfer to unseen tasks with minimal reward information. This approach offers a scalable path to generalist agents that can rapidly adapt to new objectives without task-specific labels or online fine-tuning, with practical impact in robotics and beyond.

Abstract

Can we pre-train a generalist agent from a large amount of unlabeled offline trajectories such that it can be immediately adapted to any new downstream tasks in a zero-shot manner? In this work, we present a functional reward encoding (FRE) as a general, scalable solution to this zero-shot RL problem. Our main idea is to learn functional representations of any arbitrary tasks by encoding their state-reward samples using a transformer-based variational auto-encoder. This functional encoding not only enables the pre-training of an agent from a wide diversity of general unsupervised reward functions, but also provides a way to solve any new downstream tasks in a zero-shot manner, given a small number of reward-annotated samples. We empirically show that FRE agents trained on diverse random unsupervised reward functions can generalize to solve novel tasks in a range of simulated robotic benchmarks, often outperforming previous zero-shot RL and offline RL methods. Code for this project is provided at: https://github.com/kvfrans/fre
Paper Structure (20 sections, 6 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 20 sections, 6 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: FRE discovers latent representations over random unsupervised reward functions. At evaluation, user-given downstream objectives can be encoded into the latent space to enable zero-shot policy execution. FRE utilizes simple building blocks and is a data-scalable way to learn general capabilities from unlabeled offline trajectory data.
  • Figure 2: FRE encodes a reward function by evaluating its output over a random set of data states. Given a sampled reward function $\eta$, the reward function is first evaluated on a set of random encoder states from the offline dataset. The $(s,\eta(s))$ pairs are then passed into a permutation-invariant transformer encoder, which produces a latent task embedding $z$. A decoder head is then optimized to minimize the mean-squared error between the true reward and the predicted reward on a set of decoder states. The encoder-decoder structure is trained jointly, and $z$ can be utilized for downstream learning of task-conditioned policies and value functions.
  • Figure 3: After unsupervised pretraining, FRE can solve user-specified downstream tasks without additional fine-tuning. Shown above are examples of reward functions sampled from various evaluations in AntMaze. Columns: 1) True reward function projected onto maze. 2) Random states used for encoding shown in non-black. 3) Reward predicted by decoder network. 4) Behavior of FRE policy conditioned on latent encoding. Agents start at the red dot. 5) Visualization of predicted value function.
  • Figure 4: Evaluation domains: AntMaze, ExORL, and Kitchen.
  • Figure 5: The general capabilities of a FRE agent scales with diversity of random functions used in training. FRE-all represents an agent trained on a uniform mixture of three random reward families, while each other column represents a specific agent trained on only a subset of the three. The robust FRE-all agent displays the largest total score, and competitive performance among all evaluation tasks, showing that the FRE encoding can combine reward function distributions without losing performance.
  • ...and 4 more figures