Table of Contents
Fetching ...

RLZero: Direct Policy Inference from Language Without In-Domain Supervision

Harshit Sikchi, Siddhant Agarwal, Pranaya Jajoo, Samyak Parajuli, Caleb Chuck, Max Rudolph, Peter Stone, Amy Zhang, Scott Niekum

TL;DR

This work uses a pretrained RL agent trained using only unlabeled, offline interactions--without task-specific supervision or labeled trajectories--to get zero-shot test-time policy inference from arbitrary natural language instructions, and is the first approach to show direct language-to-behavior generation abilities on a variety of tasks and environments without any in-domain supervision.

Abstract

The reward hypothesis states that all goals and purposes can be understood as the maximization of a received scalar reward signal. However, in practice, defining such a reward signal is notoriously difficult, as humans are often unable to predict the optimal behavior corresponding to a reward function. Natural language offers an intuitive alternative for instructing reinforcement learning (RL) agents, yet previous language-conditioned approaches either require costly supervision or test-time training given a language instruction. In this work, we present a new approach that uses a pretrained RL agent trained using only unlabeled, offline interactions--without task-specific supervision or labeled trajectories--to get zero-shot test-time policy inference from arbitrary natural language instructions. We introduce a framework comprising three steps: imagine, project, and imitate. First, the agent imagines a sequence of observations corresponding to the provided language description using video generative models. Next, these imagined observations are projected into the target environment domain. Finally, an agent pretrained in the target environment with unsupervised RL instantly imitates the projected observation sequence through a closed-form solution. To the best of our knowledge, our method, RLZero, is the first approach to show direct language-to-behavior generation abilities on a variety of tasks and environments without any in-domain supervision. We further show that components of RLZero can be used to generate policies zero-shot from cross-embodied videos, such as those available on YouTube, even for complex embodiments like humanoids.

RLZero: Direct Policy Inference from Language Without In-Domain Supervision

TL;DR

This work uses a pretrained RL agent trained using only unlabeled, offline interactions--without task-specific supervision or labeled trajectories--to get zero-shot test-time policy inference from arbitrary natural language instructions, and is the first approach to show direct language-to-behavior generation abilities on a variety of tasks and environments without any in-domain supervision.

Abstract

The reward hypothesis states that all goals and purposes can be understood as the maximization of a received scalar reward signal. However, in practice, defining such a reward signal is notoriously difficult, as humans are often unable to predict the optimal behavior corresponding to a reward function. Natural language offers an intuitive alternative for instructing reinforcement learning (RL) agents, yet previous language-conditioned approaches either require costly supervision or test-time training given a language instruction. In this work, we present a new approach that uses a pretrained RL agent trained using only unlabeled, offline interactions--without task-specific supervision or labeled trajectories--to get zero-shot test-time policy inference from arbitrary natural language instructions. We introduce a framework comprising three steps: imagine, project, and imitate. First, the agent imagines a sequence of observations corresponding to the provided language description using video generative models. Next, these imagined observations are projected into the target environment domain. Finally, an agent pretrained in the target environment with unsupervised RL instantly imitates the projected observation sequence through a closed-form solution. To the best of our knowledge, our method, RLZero, is the first approach to show direct language-to-behavior generation abilities on a variety of tasks and environments without any in-domain supervision. We further show that components of RLZero can be used to generate policies zero-shot from cross-embodied videos, such as those available on YouTube, even for complex embodiments like humanoids.

Paper Structure

This paper contains 33 sections, 2 theorems, 16 equations, 9 figures, 9 tables, 1 algorithm.

Key Result

theorem 1

Define $J(\pi, r)$ to be the expected return of a policy $\pi$ under reward $r$. For an offline dataset $d^O$ with density $\rho$, a learned log distribution ratio: $\nu(s)=\log(\frac{\rho^E(s)}{\rho(s)})$, $D_{KL}(\rho^\pi, \rho^E) \le -J(\pi, r^{imit}) + D_{KL}(\rho^\pi(s,a), \rho(s,a))$ where $r

Figures (9)

  • Figure 1: RLZero framework of imagine, project, and imitate: A video trajectory is imagined using the text prompt and each frame is projected to agent's observation space. A closed form solution to imitation learning for BFMs trained with unsupervised RL is used to obtain a policy that mimics the projected video behavior.
  • Figure 2: Grounding Imagination in Real Observations: We use nearest image retrieval defined by cosine similarity in the embedding space to output a real observation from the dataset that matches the imagined frame.
  • Figure 3: Examples for cross embodied imitation: RLZero can mimic motions demonstrated in YouTube or AI generated videos zero-shot. Top 2 rows: Stickman (2D Humanoid), Bottom 2 rows: SMPL 3D Humanoid.
  • Figure 4: Failure Cases in RLZero
  • Figure 5: Illustrative diagram of imagination-free RLZero inference
  • ...and 4 more figures

Theorems & Definitions (3)

  • theorem 1
  • theorem 1
  • proof