Table of Contents
Fetching ...

MoDem: Accelerating Visual Model-Based Reinforcement Learning with Demonstrations

Nicklas Hansen, Yixin Lin, Hao Su, Xiaolong Wang, Vikash Kumar, Aravind Rajeswaran

TL;DR

MoDem tackles the sample-efficiency bottleneck in visual model-based RL by introducing a three-phase framework that leverages a small set of demonstrations. It combines policy pretraining, seeding with exploration, and aggressive demonstration oversampling (along with data augmentation) to stabilize and accelerate learning, using TD-MPC as the backbone. Across 21 challenging visuo-motor tasks with sparse rewards and a 100K-step budget, MoDem achieves substantial gains over strong baselines, highlighting the critical roles of each phase and the benefits of end-to-end representation learning. The approach offers practical impact for robotics and embodied AI by enabling effective, data-efficient learning from limited expert demonstrations.

Abstract

Poor sample efficiency continues to be the primary challenge for deployment of deep Reinforcement Learning (RL) algorithms for real-world applications, and in particular for visuo-motor control. Model-based RL has the potential to be highly sample efficient by concurrently learning a world model and using synthetic rollouts for planning and policy improvement. However, in practice, sample-efficient learning with model-based RL is bottlenecked by the exploration challenge. In this work, we find that leveraging just a handful of demonstrations can dramatically improve the sample-efficiency of model-based RL. Simply appending demonstrations to the interaction dataset, however, does not suffice. We identify key ingredients for leveraging demonstrations in model learning -- policy pretraining, targeted exploration, and oversampling of demonstration data -- which forms the three phases of our model-based RL framework. We empirically study three complex visuo-motor control domains and find that our method is 150%-250% more successful in completing sparse reward tasks compared to prior approaches in the low data regime (100K interaction steps, 5 demonstrations). Code and videos are available at: https://nicklashansen.github.io/modemrl

MoDem: Accelerating Visual Model-Based Reinforcement Learning with Demonstrations

TL;DR

MoDem tackles the sample-efficiency bottleneck in visual model-based RL by introducing a three-phase framework that leverages a small set of demonstrations. It combines policy pretraining, seeding with exploration, and aggressive demonstration oversampling (along with data augmentation) to stabilize and accelerate learning, using TD-MPC as the backbone. Across 21 challenging visuo-motor tasks with sparse rewards and a 100K-step budget, MoDem achieves substantial gains over strong baselines, highlighting the critical roles of each phase and the benefits of end-to-end representation learning. The approach offers practical impact for robotics and embodied AI by enabling effective, data-efficient learning from limited expert demonstrations.

Abstract

Poor sample efficiency continues to be the primary challenge for deployment of deep Reinforcement Learning (RL) algorithms for real-world applications, and in particular for visuo-motor control. Model-based RL has the potential to be highly sample efficient by concurrently learning a world model and using synthetic rollouts for planning and policy improvement. However, in practice, sample-efficient learning with model-based RL is bottlenecked by the exploration challenge. In this work, we find that leveraging just a handful of demonstrations can dramatically improve the sample-efficiency of model-based RL. Simply appending demonstrations to the interaction dataset, however, does not suffice. We identify key ingredients for leveraging demonstrations in model learning -- policy pretraining, targeted exploration, and oversampling of demonstration data -- which forms the three phases of our model-based RL framework. We empirically study three complex visuo-motor control domains and find that our method is 150%-250% more successful in completing sparse reward tasks compared to prior approaches in the low data regime (100K interaction steps, 5 demonstrations). Code and videos are available at: https://nicklashansen.github.io/modemrl
Paper Structure (25 sections, 8 equations, 14 figures, 4 tables, 1 algorithm)

This paper contains 25 sections, 8 equations, 14 figures, 4 tables, 1 algorithm.

Figures (14)

  • Figure 1: Success rate (%) in sparse reward tasks. Given only 5 human demonstrations and a limited online interaction budget, our method significantly improves the success rate in 21 challenging visuo-motor control tasks.
  • Figure 2: Our framework (MoDem) consists of three phases: (1) a policy pretraining phase where representation and policy is trained on a handful of demonstrations via BC, (2) a seeding phase where the pretrained policy is used to generate rollouts for targeted model learning, and (3) an interactive learning phase where the model iteratively collects new rollouts and is trained with data from all three phases. Crucially, we aggressively oversample demonstration data for model learning, regularize the model using data augmentation, and reuse weights across phases. $\operatorname{sg}$: stop-gradient operator.
  • Figure 3: Tasks. We evaluate methods on a total of $\mathbf{21}$ challenging image-based tasks spanning three domains -- Adroit Rajeswaran-RSS-18, Meta-World yu2019meta, DMControl deepmindcontrolsuite2018. Observations are raw ($224\times224$) RGB frames (pictured). Environments contain rich visual features such as textures and shading, and require particularly fine-grained control due to complex geometry. See Appendix \ref{['sec:appendix-experimental-setup']} and \ref{['sec:appendix-task-visualizations']} for additional visualizations and a full list of tasks.
  • Figure 4: Main result. Success rate and episode return as a function of interaction steps for each of the three domains that we consider (Adroit, Meta-World, DMControl), aggregated across a total of $\mathbf{21}$ challenging, visual robotics tasks. Adroit and Meta-World use sparse rewards. Mean of 5 seeds; shaded area indicates 95% CIs. Our method is significantly more sample-efficient than prior methods.
  • Figure 5: Meta-World. Success rate for our method and baselines on 15 difficult, sparse-reward Meta-World tasks with image inputs. Mean of 5 seeds; shaded area indicates $95\%$ CIs.
  • ...and 9 more figures