Table of Contents
Fetching ...

Offline Reinforcement Learning from Images with Latent Space Models

Rafael Rafailov, Tianhe Yu, Aravind Rajeswaran, Chelsea Finn

TL;DR

LOMPO tackles offline vision-based control by learning a variational latent dynamics model with an ensemble to quantify uncertainty in the latent space. It constructs an uncertainty-penalized latent MDP and optimizes policies via an uncertainty-aware ELBO, providing a pessimistic regularization that mitigates distribution shift in offline data. Across four simulated image-based tasks and a real-robot drawer task, LOMPO outperforms offline model-free methods and rivals online model-based approaches, demonstrating strong sample efficiency and robustness to dataset size. The work enables practical, safe visuomotor control from pre-collected data and lays groundwork for scalable, multi-task offline vision RL.

Abstract

Offline reinforcement learning (RL) refers to the problem of learning policies from a static dataset of environment interactions. Offline RL enables extensive use and re-use of historical datasets, while also alleviating safety concerns associated with online exploration, thereby expanding the real-world applicability of RL. Most prior work in offline RL has focused on tasks with compact state representations. However, the ability to learn directly from rich observation spaces like images is critical for real-world applications such as robotics. In this work, we build on recent advances in model-based algorithms for offline RL, and extend them to high-dimensional visual observation spaces. Model-based offline RL algorithms have achieved state of the art results in state based tasks and have strong theoretical guarantees. However, they rely crucially on the ability to quantify uncertainty in the model predictions, which is particularly challenging with image observations. To overcome this challenge, we propose to learn a latent-state dynamics model, and represent the uncertainty in the latent space. Our approach is both tractable in practice and corresponds to maximizing a lower bound of the ELBO in the unknown POMDP. In experiments on a range of challenging image-based locomotion and manipulation tasks, we find that our algorithm significantly outperforms previous offline model-free RL methods as well as state-of-the-art online visual model-based RL methods. Moreover, we also find that our approach excels on an image-based drawer closing task on a real robot using a pre-existing dataset. All results including videos can be found online at https://sites.google.com/view/lompo/ .

Offline Reinforcement Learning from Images with Latent Space Models

TL;DR

LOMPO tackles offline vision-based control by learning a variational latent dynamics model with an ensemble to quantify uncertainty in the latent space. It constructs an uncertainty-penalized latent MDP and optimizes policies via an uncertainty-aware ELBO, providing a pessimistic regularization that mitigates distribution shift in offline data. Across four simulated image-based tasks and a real-robot drawer task, LOMPO outperforms offline model-free methods and rivals online model-based approaches, demonstrating strong sample efficiency and robustness to dataset size. The work enables practical, safe visuomotor control from pre-collected data and lays groundwork for scalable, multi-task offline vision RL.

Abstract

Offline reinforcement learning (RL) refers to the problem of learning policies from a static dataset of environment interactions. Offline RL enables extensive use and re-use of historical datasets, while also alleviating safety concerns associated with online exploration, thereby expanding the real-world applicability of RL. Most prior work in offline RL has focused on tasks with compact state representations. However, the ability to learn directly from rich observation spaces like images is critical for real-world applications such as robotics. In this work, we build on recent advances in model-based algorithms for offline RL, and extend them to high-dimensional visual observation spaces. Model-based offline RL algorithms have achieved state of the art results in state based tasks and have strong theoretical guarantees. However, they rely crucially on the ability to quantify uncertainty in the model predictions, which is particularly challenging with image observations. To overcome this challenge, we propose to learn a latent-state dynamics model, and represent the uncertainty in the latent space. Our approach is both tractable in practice and corresponds to maximizing a lower bound of the ELBO in the unknown POMDP. In experiments on a range of challenging image-based locomotion and manipulation tasks, we find that our algorithm significantly outperforms previous offline model-free RL methods as well as state-of-the-art online visual model-based RL methods. Moreover, we also find that our approach excels on an image-based drawer closing task on a real robot using a pre-existing dataset. All results including videos can be found online at https://sites.google.com/view/lompo/ .

Paper Structure

This paper contains 22 sections, 9 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: LOMPO learns vision-based policies from offline datasets, without any interaction in the environment.
  • Figure 2: Images are passed through a convolutional encoder $E_\theta$ to form a compact representation which are then used along with previous state to infer the current state $s_t$. The model is trained by reconstructing the images from the latent states through the decoder network $D_{\theta}$. Latent rollouts are carried by choosing a random learned transition model $\widehat{T}_{\theta_j}(s_{t+1}|s_t, a_t)$ and rewards are penalized based on ensemble disagreement.
  • Figure 3: Test environments: DeepMind Control Walker task - the observations are raw $64\times64$ images. Robel D'Claw Screw and Adroit Pen tasks observations are raw $128\times128$ images and robot proprioception. Sawyer Door open environment - the observation space is raw $128\times128$ images. The observations for the real robot environment are raw $64\times 64$ images from the overhead camera.
  • Figure 4: Agent performance based on dataset size
  • Figure 5: Samples from the learned variational model. Fist row: ground truth sequence; second row: posterior model samples; third row: ensemble latent model rollout conditioned on the action sequence.