Table of Contents
Fetching ...

Masked World Models for Visual Control

Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James, Kimin Lee, Pieter Abbeel

TL;DR

<3-5 sentence high-level summary>MWM tackles the challenge of sample-efficient visual model-based reinforcement learning by decoupling visual representation learning from dynamics learning. It uses a convolutional-feature-mMasked autoencoder with an auxiliary reward-prediction objective and trains a latent dynamics model on the learned representations, all updated online. Empirically, it achieves state-of-the-art results on challenging visual robotic tasks across Meta-world and RLBench, outperforming DreamerV2, and demonstrates that convolutional feature masking can outperform patch-based MAE. The work also provides qualitative insights into how reward-guided representations and task-focused latent dynamics improve prediction of relevant objects and actions, pointing to broader potential in multi-modal and temporally rich extensions.

Abstract

Visual model-based reinforcement learning (RL) has the potential to enable sample-efficient robot learning from visual observations. Yet the current approaches typically train a single model end-to-end for learning both visual representations and dynamics, making it difficult to accurately model the interaction between robots and small objects. In this work, we introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning. Specifically, we train an autoencoder with convolutional layers and vision transformers (ViT) to reconstruct pixels given masked convolutional features, and learn a latent dynamics model that operates on the representations from the autoencoder. Moreover, to encode task-relevant information, we introduce an auxiliary reward prediction objective for the autoencoder. We continually update both autoencoder and dynamics model using online samples collected from environment interaction. We demonstrate that our decoupling approach achieves state-of-the-art performance on a variety of visual robotic tasks from Meta-world and RLBench, e.g., we achieve 81.7% success rate on 50 visual robotic manipulation tasks from Meta-world, while the baseline achieves 67.9%. Code is available on the project website: https://sites.google.com/view/mwm-rl.

Masked World Models for Visual Control

TL;DR

<3-5 sentence high-level summary>MWM tackles the challenge of sample-efficient visual model-based reinforcement learning by decoupling visual representation learning from dynamics learning. It uses a convolutional-feature-mMasked autoencoder with an auxiliary reward-prediction objective and trains a latent dynamics model on the learned representations, all updated online. Empirically, it achieves state-of-the-art results on challenging visual robotic tasks across Meta-world and RLBench, outperforming DreamerV2, and demonstrates that convolutional feature masking can outperform patch-based MAE. The work also provides qualitative insights into how reward-guided representations and task-focused latent dynamics improve prediction of relevant objects and actions, pointing to broader potential in multi-modal and temporally rich extensions.

Abstract

Visual model-based reinforcement learning (RL) has the potential to enable sample-efficient robot learning from visual observations. Yet the current approaches typically train a single model end-to-end for learning both visual representations and dynamics, making it difficult to accurately model the interaction between robots and small objects. In this work, we introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning. Specifically, we train an autoencoder with convolutional layers and vision transformers (ViT) to reconstruct pixels given masked convolutional features, and learn a latent dynamics model that operates on the representations from the autoencoder. Moreover, to encode task-relevant information, we introduce an auxiliary reward prediction objective for the autoencoder. We continually update both autoencoder and dynamics model using online samples collected from environment interaction. We demonstrate that our decoupling approach achieves state-of-the-art performance on a variety of visual robotic tasks from Meta-world and RLBench, e.g., we achieve 81.7% success rate on 50 visual robotic manipulation tasks from Meta-world, while the baseline achieves 67.9%. Code is available on the project website: https://sites.google.com/view/mwm-rl.
Paper Structure (67 sections, 13 equations, 16 figures, 1 table, 1 algorithm)

This paper contains 67 sections, 13 equations, 16 figures, 1 table, 1 algorithm.

Figures (16)

  • Figure 1: Illustration of our approach. We continually update visual representations and dynamics using online samples collected from environment interaction, by repeating iterative processes of training (Left) an autoencoder with convolutional feature masking and reward prediction and (Right) a latent dynamics model in the latent space of the autoencoder. We note that autoencoder parameters are not updated during dynamics learning.
  • Figure 2: Examples of visual observations used in our experiments. We consider a variety of visual robot control tasks from Meta-world yu2020meta, RLBench james2020rlbench, and DeepMind Control Suite tassa2020dm_control.
  • Figure 3: Learning curves on six visual robotic manipulation tasks from Meta-world as measured on the success rate. We select the tasks that require modeling interactions between small objects and robot arms. Learning curves on 50 tasks are available in \ref{['appendix:full_meta_world']}. The solid line and shaded regions represent the mean and bootstrap confidence intervals, respectively, across five runs.
  • Figure 4: (a) Aggregate performance on all 50 Meta-world tasks. We normalize environment steps by maximum steps in each task. The solid line and shaded regions represent the mean and stratified bootstrap confidence intervals, respectively, across 250 runs. We report the learning curves on (b) Reach Target and (c) Push Button from RLBench. Performances are not directly comparable to previous results james2022qjames2021coarse due to the difference in setups (see \ref{['subsec:experiment_rlbench']}). The solid line and shaded regions represent the mean and bootstrap confidence intervals, respectively, across eight runs.
  • Figure 5: Learning curves on three visual robot control tasks from DeepMind Control Suite as measured on the episode return. The solid line and shaded regions represent the mean and bootstrap confidence intervals, respectively, across eight runs.
  • ...and 11 more figures