Table of Contents
Fetching ...

Recurrent World Models Facilitate Policy Evolution

David Ha, Jürgen Schmidhuber

TL;DR

This work introduces a three-component recurrent world-model (ConvVAE V, MDN-RNN M, and a small linear controller C) trained with CMA-ES to solve pixel-based RL tasks. It shows state-of-the-art performance on CarRacing-v0 and demonstrates training inside a latent, generated Doom environment with successful transfer to the real Doom task, aided by a tunable uncertainty parameter τ to prevent exploiting model inaccuracies. A key insight is that stochastic dynamics in the latent world can support robust policy learning and transfer, offering a path toward efficient sim-to-real and latent-space reinforcement learning. The study highlights both the potential and the limitations of learning and exploiting internal generative models for policy evolution.

Abstract

A generative recurrent neural network is quickly trained in an unsupervised manner to model popular reinforcement learning environments through compressed spatio-temporal representations. The world model's extracted features are fed into compact and simple policies trained by evolution, achieving state of the art results in various environments. We also train our agent entirely inside of an environment generated by its own internal world model, and transfer this policy back into the actual environment. Interactive version of paper at https://worldmodels.github.io

Recurrent World Models Facilitate Policy Evolution

TL;DR

This work introduces a three-component recurrent world-model (ConvVAE V, MDN-RNN M, and a small linear controller C) trained with CMA-ES to solve pixel-based RL tasks. It shows state-of-the-art performance on CarRacing-v0 and demonstrates training inside a latent, generated Doom environment with successful transfer to the real Doom task, aided by a tunable uncertainty parameter τ to prevent exploiting model inaccuracies. A key insight is that stochastic dynamics in the latent world can support robust policy learning and transfer, offering a path toward efficient sim-to-real and latent-space reinforcement learning. The study highlights both the potential and the limitations of learning and exploiting internal generative models for policy evolution.

Abstract

A generative recurrent neural network is quickly trained in an unsupervised manner to model popular reinforcement learning environments through compressed spatio-temporal representations. The world model's extracted features are fed into compact and simple policies trained by evolution, achieving state of the art results in various environments. We also train our agent entirely inside of an environment generated by its own internal world model, and transfer this policy back into the actual environment. Interactive version of paper at https://worldmodels.github.io

Paper Structure

This paper contains 16 sections, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: We build probabilistic generative models of OpenAI Gym openai_gym environments. These models can mimic the actual environments (left). We test trained policies in the actual environments (right).
  • Figure 2: Flow diagram showing how V, M, and C interacts with the environment (left).Pseudocode for how our agent model is used in the OpenAI Gym openai_gym environment (right).
  • Figure 3: Description of tensor shapes for each layer of our ConvVAE. (left).MDN-RNN similar to the one used in graves_rnnsketchrnncarter2016experiments (right).
  • Figure 4: Training progress of CarRacing-v0 (left).Histogram of cumulative rewards. Score is 906 $\pm$ 21 (right).
  • Figure 5: When agent sees only $z_t$ but not $h_t$, score is 632 $\pm$ 251 (left).If we add a hidden layer on top of only $z_t$, score increases to 788 $\pm$ 141 (right).
  • ...and 1 more figures