Table of Contents
Fetching ...

Shaping Belief States with Generative Environment Models for RL

Karol Gregor, Danilo Jimenez Rezende, Frederic Besse, Yan Wu, Hamza Merzic, Aaron van den Oord

TL;DR

The paper tackles how reinforcement learning agents can form stable, long-term beliefs about partially observable 3D environments by training expressive generative environment models conditioned on the agent's belief-state. It introduces an architecture that couples a belief-state RNN with a forward-simulating core and a frame-generative model (ConvDRAW with GECO), and demonstrates that overshooting future predictions is crucial for maintaining coherent maps and localization. Across Random City, DeepMind Lab, and voxel environments, the approach yields significant data-efficiency gains over strong model-free baselines, with memory architectures (Kanerva, slot-based) and joint training influencing representation quality and stability. The findings highlight both the potential of learned environment models for representation learning in RL and the practical challenges of conditioning expressive generators and scaling memory, pointing to future work in planning integration and larger-scale demonstrations.

Abstract

When agents interact with a complex environment, they must form and maintain beliefs about the relevant aspects of that environment. We propose a way to efficiently train expressive generative models in complex environments. We show that a predictive algorithm with an expressive generative model can form stable belief-states in visually rich and dynamic 3D environments. More precisely, we show that the learned representation captures the layout of the environment as well as the position and orientation of the agent. Our experiments show that the model substantially improves data-efficiency on a number of reinforcement learning (RL) tasks compared with strong model-free baseline agents. We find that predicting multiple steps into the future (overshooting), in combination with an expressive generative model, is critical for stable representations to emerge. In practice, using expressive generative models in RL is computationally expensive and we propose a scheme to reduce this computational burden, allowing us to build agents that are competitive with model-free baselines.

Shaping Belief States with Generative Environment Models for RL

TL;DR

The paper tackles how reinforcement learning agents can form stable, long-term beliefs about partially observable 3D environments by training expressive generative environment models conditioned on the agent's belief-state. It introduces an architecture that couples a belief-state RNN with a forward-simulating core and a frame-generative model (ConvDRAW with GECO), and demonstrates that overshooting future predictions is crucial for maintaining coherent maps and localization. Across Random City, DeepMind Lab, and voxel environments, the approach yields significant data-efficiency gains over strong model-free baselines, with memory architectures (Kanerva, slot-based) and joint training influencing representation quality and stability. The findings highlight both the potential of learned environment models for representation learning in RL and the practical challenges of conditioning expressive generators and scaling memory, pointing to future work in planning integration and larger-scale demonstrations.

Abstract

When agents interact with a complex environment, they must form and maintain beliefs about the relevant aspects of that environment. We propose a way to efficiently train expressive generative models in complex environments. We show that a predictive algorithm with an expressive generative model can form stable belief-states in visually rich and dynamic 3D environments. More precisely, we show that the learned representation captures the layout of the environment as well as the position and orientation of the agent. Our experiments show that the model substantially improves data-efficiency on a number of reinforcement learning (RL) tasks compared with strong model-free baseline agents. We find that predicting multiple steps into the future (overshooting), in combination with an expressive generative model, is critical for stable representations to emerge. In practice, using expressive generative models in RL is computationally expensive and we propose a scheme to reduce this computational burden, allowing us to build agents that are competitive with model-free baselines.

Paper Structure

This paper contains 21 sections, 1 equation, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Diagram of the agent and model. The agent receives observations $x$ from the environment, processes them through a feed-forward residual network (green) and forms a state using a recurrent network (blue), online. This state is a belief state and is used to calculate policy and value as well as being the starting point for predictions of the future. These are done using a second recurrent network (orange) - a simulation network (SimCore) that simulates into the future seeing only the actions. The simulated state is used to conditioning for a generative model (red) of a future frame.
  • Figure 2: Random City environment. Rows: 1. Input to the model sequence starting from the beginning of the episode. 2. Top down view (a map). 3. Top down view decoded from the belief state. The belief state was not trained with this decoding signal, but only from the first person view (top row). We see that the model is able to fill up the map as it sees new frames. 4. Frames later in the sequence (after 170 steps). 5. Rollout from the model. The model know what will it see as the agent rotates. See supplementary video \suppvideo.
  • Figure 3: The choice of model and overshoot length have significant impact on state representation. (a) All models benefit from an increase in the overshoot length with respect to position decoding, with the Contrastive model reaching higher accuracy; (b) The Generative models are the most sensitive to overshoot length with respect to Map decoding MSE. A substantial reduction in map decoding MSE is obtained by using architectures with memory; (c) Examples of decoded maps. Each block shows real maps (top-row) and decoded maps (bottom-row). Top block: Contrastive model samples at Overshoot Length $1$ (MSE of approx. 160); Bottom block: Generative + Kanerva at Overshoot Length $12$ (MSE of approx. 117). We can clearly notice the difference in the details for both models.
  • Figure 4: Effect of overshoot on environment's map decoding. This analysis shows that Generative and Generative + Kanerva benefit the most from an increase in overshoot length in contrast to Deterministic and Contrastive architectures. In particular, we observe that Generative + Kanerva architecture is particularly good at forming belief-states that contain a map of the environment.
  • Figure 5: Generative SimCore results in substantial data-efficiency gains for agents in DeepMind-Lab relative to a strong model-free baseline. We also observe that model-free agents have substantially higher variance in their scores. See supplementary video \suppvideo.
  • ...and 10 more figures