Table of Contents
Fetching ...

Accelerating Model-Based Reinforcement Learning with State-Space World Models

Maria Krinner, Elie Aljalbout, Angel Romero, Davide Scaramuzza

TL;DR

Model-based RL often yields better sample efficiency but suffers from slow training due to sequential world-model updates. The authors introduce S5WM, a state-space world model that replaces recurrent RSSMs with the parallelizable S5 architecture and leverages privileged state information during training, achieving substantial speedups without sacrificing performance. Across state-based and vision-based quadrotor tasks, S5WM matches or exceeds DreamerV3 in task rewards and sample efficiency while delivering up to $4\times$ faster overall training and up to $10\times$ faster world-model training. The work demonstrates strong sim-to-real transfer on agile drone tasks and provides a roadmap for faster, more practical MBRL in real-world robotics.

Abstract

Reinforcement learning (RL) is a powerful approach for robot learning. However, model-free RL (MFRL) requires a large number of environment interactions to learn successful control policies. This is due to the noisy RL training updates and the complexity of robotic systems, which typically involve highly non-linear dynamics and noisy sensor signals. In contrast, model-based RL (MBRL) not only trains a policy but simultaneously learns a world model that captures the environment's dynamics and rewards. The world model can either be used for planning, for data collection, or to provide first-order policy gradients for training. Leveraging a world model significantly improves sample efficiency compared to model-free RL. However, training a world model alongside the policy increases the computational complexity, leading to longer training times that are often intractable for complex real-world scenarios. In this work, we propose a new method for accelerating model-based RL using state-space world models. Our approach leverages state-space models (SSMs) to parallelize the training of the dynamics model, which is typically the main computational bottleneck. Additionally, we propose an architecture that provides privileged information to the world model during training, which is particularly relevant for partially observable environments. We evaluate our method in several real-world agile quadrotor flight tasks, involving complex dynamics, for both fully and partially observable environments. We demonstrate a significant speedup, reducing the world model training time by up to 10 times, and the overall MBRL training time by up to 4 times. This benefit comes without compromising performance, as our method achieves similar sample efficiency and task rewards to state-of-the-art MBRL methods.

Accelerating Model-Based Reinforcement Learning with State-Space World Models

TL;DR

Model-based RL often yields better sample efficiency but suffers from slow training due to sequential world-model updates. The authors introduce S5WM, a state-space world model that replaces recurrent RSSMs with the parallelizable S5 architecture and leverages privileged state information during training, achieving substantial speedups without sacrificing performance. Across state-based and vision-based quadrotor tasks, S5WM matches or exceeds DreamerV3 in task rewards and sample efficiency while delivering up to faster overall training and up to faster world-model training. The work demonstrates strong sim-to-real transfer on agile drone tasks and provides a roadmap for faster, more practical MBRL in real-world robotics.

Abstract

Reinforcement learning (RL) is a powerful approach for robot learning. However, model-free RL (MFRL) requires a large number of environment interactions to learn successful control policies. This is due to the noisy RL training updates and the complexity of robotic systems, which typically involve highly non-linear dynamics and noisy sensor signals. In contrast, model-based RL (MBRL) not only trains a policy but simultaneously learns a world model that captures the environment's dynamics and rewards. The world model can either be used for planning, for data collection, or to provide first-order policy gradients for training. Leveraging a world model significantly improves sample efficiency compared to model-free RL. However, training a world model alongside the policy increases the computational complexity, leading to longer training times that are often intractable for complex real-world scenarios. In this work, we propose a new method for accelerating model-based RL using state-space world models. Our approach leverages state-space models (SSMs) to parallelize the training of the dynamics model, which is typically the main computational bottleneck. Additionally, we propose an architecture that provides privileged information to the world model during training, which is particularly relevant for partially observable environments. We evaluate our method in several real-world agile quadrotor flight tasks, involving complex dynamics, for both fully and partially observable environments. We demonstrate a significant speedup, reducing the world model training time by up to 10 times, and the overall MBRL training time by up to 4 times. This benefit comes without compromising performance, as our method achieves similar sample efficiency and task rewards to state-of-the-art MBRL methods.

Paper Structure

This paper contains 35 sections, 21 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: State-of-the-art model-based RL (MBRL) methods typically employ recurrent state-space models (RSSMs) as the world model backbone, which are slow in training due to the sequential nature of RNNs. We leverage state-space models (SSMs) to parallelize the sequence dimension of the world model, thereby reducing the computational complexity of training the world model. Moreover, we propose reconstructing privileged observations of lower dimensionality $s_t$, rather than high-dimensional image observations $o_t$.
  • Figure 2: To train the actor and critic, we leverage imaginations in the latent state and using our state-space world model. We obtain these imaginations by encoding the initial observation and rolling out the sequence model in the latent space. Despite the possibility of training the world model in a parallel fashion, the imagination step (needed to train the actor-critic) cannot be parallelized due to the dependence on the policy to generate actions needed for rolling out the trajectories.
  • Figure 3: Task reward over the number of environment interactions for S5WM, DreamerV3 and PPO.
  • Figure 4: Each training step is divided into: i) training the world model (WM), ii) optimizing the policy (AC), and iii) collecting new data. We show the times for each stage, as well as the overall duration per step, averaged over $\text{5}\times\text{10}^5$ steps.
  • Figure 5: TODO: quality is really bad
  • ...and 10 more figures