Accurate and Efficient World Modeling with Masked Latent Transformers

Maxime Burchi; Radu Timofte

Accurate and Efficient World Modeling with Masked Latent Transformers

Maxime Burchi, Radu Timofte

TL;DR

The paper addresses the challenge of achieving both accuracy and efficiency in world modeling for model-based RL in complex environments. It introduces EMERALD, a masked latent Transformer world model that uses a spatial latent state and a MaskGIT predictor to generate accurate trajectories in latent space, enabling imagination-driven actor-critic learning. Empirical results on Crafter show state-of-the-art performance, surpassing human experts within 10M environment steps and unlocking all 22 achievements, while also demonstrating improved training efficiency over pixel-based approaches. The work highlights the advantages of combining spatial latents, Transformer memory, and masked latent decoding to preserve important perceptual details and long-term memory, with potential applicability to broader domains beyond Crafter.

Abstract

The Dreamer algorithm has recently obtained remarkable performance across diverse environment domains by training powerful agents with simulated trajectories. However, the compressed nature of its world model's latent space can result in the loss of crucial information, negatively affecting the agent's performance. Recent approaches, such as $Δ$-IRIS and DIAMOND, address this limitation by training more accurate world models. However, these methods require training agents directly from pixels, which reduces training efficiency and prevents the agent from benefiting from the inner representations learned by the world model. In this work, we propose an alternative approach to world modeling that is both accurate and efficient. We introduce EMERALD (Efficient MaskEd latent tRAnsformer worLD model), a world model using a spatial latent state with MaskGIT predictions to generate accurate trajectories in latent space and improve the agent performance. On the Crafter benchmark, EMERALD achieves new state-of-the-art performance, becoming the first method to surpass human experts performance within 10M environment steps. Our method also succeeds to unlock all 22 Crafter achievements at least once during evaluation.

Accurate and Efficient World Modeling with Masked Latent Transformers

TL;DR

Abstract

-IRIS and DIAMOND, address this limitation by training more accurate world models. However, these methods require training agents directly from pixels, which reduces training efficiency and prevents the agent from benefiting from the inner representations learned by the world model. In this work, we propose an alternative approach to world modeling that is both accurate and efficient. We introduce EMERALD (Efficient MaskEd latent tRAnsformer worLD model), a world model using a spatial latent state with MaskGIT predictions to generate accurate trajectories in latent space and improve the agent performance. On the Crafter benchmark, EMERALD achieves new state-of-the-art performance, becoming the first method to surpass human experts performance within 10M environment steps. Our method also succeeds to unlock all 22 Crafter achievements at least once during evaluation.

Accurate and Efficient World Modeling with Masked Latent Transformers

TL;DR

Abstract

Accurate and Efficient World Modeling with Masked Latent Transformers

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)