Efficient World Models with Context-Aware Tokenization
Vincent Micheli, Eloi Alonso, François Fleuret
TL;DR
This work tackles the challenge of scaling model-based RL in visually rich environments by introducing $\Delta$-iris, a world-model agent that encodes stochastic deltas between time steps with discrete $\Delta$-tokens while conditioning on past frames and actions. An autoregressive transformer, augmented with continuous $[0.9]I$-tokens, predicts future deltas and rewards, enabling compact representations and faster imagination. The approach achieves state-of-the-art Crafter results across multiple frame budgets and demonstrates substantial training speedups and evidence of disentangled deterministic and stochastic dynamics. These results show a viable path toward scalable, token-efficient world models in complex domains and point to future work on dynamic token budgeting and leveraging world-model representations for policy improvement.
Abstract
Scaling up deep Reinforcement Learning (RL) methods presents a significant challenge. Following developments in generative modelling, model-based RL positions itself as a strong contender. Recent advances in sequence modelling have led to effective transformer-based world models, albeit at the price of heavy computations due to the long sequences of tokens required to accurately simulate environments. In this work, we propose $Δ$-IRIS, a new agent with a world model architecture composed of a discrete autoencoder that encodes stochastic deltas between time steps and an autoregressive transformer that predicts future deltas by summarizing the current state of the world with continuous tokens. In the Crafter benchmark, $Δ$-IRIS sets a new state of the art at multiple frame budgets, while being an order of magnitude faster to train than previous attention-based approaches. We release our code and models at https://github.com/vmicheli/delta-iris.
