Table of Contents
Fetching ...

Efficient World Models with Context-Aware Tokenization

Vincent Micheli, Eloi Alonso, François Fleuret

TL;DR

This work tackles the challenge of scaling model-based RL in visually rich environments by introducing $\Delta$-iris, a world-model agent that encodes stochastic deltas between time steps with discrete $\Delta$-tokens while conditioning on past frames and actions. An autoregressive transformer, augmented with continuous $[0.9]I$-tokens, predicts future deltas and rewards, enabling compact representations and faster imagination. The approach achieves state-of-the-art Crafter results across multiple frame budgets and demonstrates substantial training speedups and evidence of disentangled deterministic and stochastic dynamics. These results show a viable path toward scalable, token-efficient world models in complex domains and point to future work on dynamic token budgeting and leveraging world-model representations for policy improvement.

Abstract

Scaling up deep Reinforcement Learning (RL) methods presents a significant challenge. Following developments in generative modelling, model-based RL positions itself as a strong contender. Recent advances in sequence modelling have led to effective transformer-based world models, albeit at the price of heavy computations due to the long sequences of tokens required to accurately simulate environments. In this work, we propose $Δ$-IRIS, a new agent with a world model architecture composed of a discrete autoencoder that encodes stochastic deltas between time steps and an autoregressive transformer that predicts future deltas by summarizing the current state of the world with continuous tokens. In the Crafter benchmark, $Δ$-IRIS sets a new state of the art at multiple frame budgets, while being an order of magnitude faster to train than previous attention-based approaches. We release our code and models at https://github.com/vmicheli/delta-iris.

Efficient World Models with Context-Aware Tokenization

TL;DR

This work tackles the challenge of scaling model-based RL in visually rich environments by introducing -iris, a world-model agent that encodes stochastic deltas between time steps with discrete -tokens while conditioning on past frames and actions. An autoregressive transformer, augmented with continuous -tokens, predicts future deltas and rewards, enabling compact representations and faster imagination. The approach achieves state-of-the-art Crafter results across multiple frame budgets and demonstrates substantial training speedups and evidence of disentangled deterministic and stochastic dynamics. These results show a viable path toward scalable, token-efficient world models in complex domains and point to future work on dynamic token budgeting and leveraging world-model representations for policy improvement.

Abstract

Scaling up deep Reinforcement Learning (RL) methods presents a significant challenge. Following developments in generative modelling, model-based RL positions itself as a strong contender. Recent advances in sequence modelling have led to effective transformer-based world models, albeit at the price of heavy computations due to the long sequences of tokens required to accurately simulate environments. In this work, we propose -IRIS, a new agent with a world model architecture composed of a discrete autoencoder that encodes stochastic deltas between time steps and an autoregressive transformer that predicts future deltas by summarizing the current state of the world with continuous tokens. In the Crafter benchmark, -IRIS sets a new state of the art at multiple frame budgets, while being an order of magnitude faster to train than previous attention-based approaches. We release our code and models at https://github.com/vmicheli/delta-iris.
Paper Structure (23 sections, 9 figures, 9 tables)

This paper contains 23 sections, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Discrete autoencoder of irisiris (left) and $\Delta$-iris (right). iris encodes and decodes frames independently, meaning that $z_t$ has to carry all the information necessary to reconstruct $x_t$. On the other hand, $\Delta$-iris' encoder and decoder are conditioned on past frames and actions, thus $z_t$ only has to capture what has changed and that cannot be inferred from actions, i.e. the stochastic delta. This conditioning scheme enables us to drastically reduce the number of tokens required to encode a frame with minimal loss ($K \ll K_I$), which is critical to speed up the autoregressive transformer that predicts future tokens.
  • Figure 2: Unrolling dynamics over time. At each time step (separated by dashed lines), the GPT-like autoregressive transformer $G$ predicts the $\Delta$-tokens for the next frame, as well as the reward and a potential episode termination. Its input sequence consists of action tokens, $\Delta$-tokens, and [0.9]I-tokens, namely continuous image embeddings that alleviate the need to attend to past $\Delta$-tokens for world modelling. More specifically, an initial frame $x_0$ is embedded into [0.9]I-token $\tilde{x_0}$. From $\tilde{x_0}$ and $a_0$, $G$ predicts the reward $\hat{r}_0$, episode termination $\hat{d}_0 \in \{0, 1\}$, and in an autoregressive manner $\hat{z}_1 = (\hat{z}_1^1, \dots, \hat{z}_1^K)$, the $\Delta$-tokens for the next frame. Note that, during the imagination procedure, the next frame (stripped box) is computed by the decoder $D$ based on previous frames, actions, and the $\Delta$-tokens generated by $G$, i.e. $x_1 = D(x_0, a_0, \hat{z}_1)$.
  • Figure 3: Evidence of dynamics disentanglement. Two trajectories are imagined with different ways of generating $\Delta$-tokens. In the top trajectory, $\Delta$-tokens are sampled randomly. In the bottom trajectory, the autoregressive transformer predicts future $\Delta$-tokens. The same starting frame ($t=0$) and sequence of actions are used. With random $\Delta$-tokens, the deterministic aspects of the dynamics (layout, movement, items, crafting) are still properly modelled, but the stochastic dynamics (mobs, health indicators) become problematic. For instance, the agent successfully cuts down a tree between $t=4$ and $t=5$, and uses wood planks to build a crafting table between $t=10$ and $t=12$. We observe that these dynamics are modelled in the same way whether $\Delta$-tokens are sampled randomly or not. However, in the top trajectory, large quantities of cows appear and disappear from the screen incoherently, whereas the bottom trajectory does not display such erratic patterns. This experiment shows that $\Delta$-iris encodes stochastic deltas between time steps with $\Delta$-tokens, and its decoder handles the deterministic aspects of world modelling. Appendix \ref{['app:fig:disentanglement']} contains additional examples.
  • Figure 4: Returns at multiple frame budgets in the Crafter benchmark. $\Delta$-iris achieves higher returns than DreamerV3 beyond 3M frames, and surpasses iris for all frame budgets considered. Removing [0.9]I-tokens from the input sequence of the autoregressive transformer significantly hurts performance.
  • Figure 5: Bottom $1\%$ test frames autoencoded by $\Delta$-iris (4 tokens) and irisiris (16 tokens). Each token takes a value in $\{1, 2, \dots, 1023, 1024\}$, i.e. $\Delta$-iris encodes frames with $4 \times \log_{2}(1024) = 40$ bits while iris uses 160 bits. Original frames, reconstructions, and errors are respectively displayed in the top, middle, and bottom rows. Even in the worst instances, $\Delta$-iris makes only minor errors, whereas iris fails to accurately reconstruct frames. These errors severely hamper the agent's performance, as it purely learns behaviours from frames generated by its autoencoder.
  • ...and 4 more figures