Table of Contents
Fetching ...

Inter-environmental world modeling for continuous and compositional dynamics

Kohei Hayashi, Masanori Koyama, Julian Jorge Andrade Guerreiro

TL;DR

This work addresses the challenge of generalizing world models across diverse environments by learning an environment-agnostic simulator that embraces continuous and compositional actions. It introduces World modeling through Lie Action (WLA), which uses a Lie group to model transitions in a latent space that is partitioned into object-centric slots, enabling linearized, compositional dynamics that lift to the observation space via an equivariant autoencoder. By training an adaptive controller in the latent Lie-algebra space and coupling it with an inverse dynamic map, WLA solves the Controller Interface Problem in a structured, minimally-labeled setting and demonstrates strong cross-environment generalization on Phyre, ProcGen, and Android-like datasets. The approach yields superior temporal coherence and action-consistent generation compared to baselines, highlighting the practical impact of integrating Lie-group structure and object-centric representations for interactive, generative world models.

Abstract

Various world model frameworks are being developed today based on autoregressive frameworks that rely on discrete representations of actions and observations, and these frameworks are succeeding in constructing interactive generative models for the target environment of interest. Meanwhile, humans demonstrate remarkable generalization abilities to combine experiences in multiple environments to mentally simulate and learn to control agents in diverse environments. Inspired by this human capability, we introduce World modeling through Lie Action (WLA), an unsupervised framework that learns continuous latent action representations to simulate across environments. WLA learns a control interface with high controllability and predictive ability by simultaneously modeling the dynamics of multiple environments using Lie group theory and object-centric autoencoder. On synthetic benchmark and real-world datasets, we demonstrate that WLA can be trained using only video frames and, with minimal or no action labels, can quickly adapt to new environments with novel action sets.

Inter-environmental world modeling for continuous and compositional dynamics

TL;DR

This work addresses the challenge of generalizing world models across diverse environments by learning an environment-agnostic simulator that embraces continuous and compositional actions. It introduces World modeling through Lie Action (WLA), which uses a Lie group to model transitions in a latent space that is partitioned into object-centric slots, enabling linearized, compositional dynamics that lift to the observation space via an equivariant autoencoder. By training an adaptive controller in the latent Lie-algebra space and coupling it with an inverse dynamic map, WLA solves the Controller Interface Problem in a structured, minimally-labeled setting and demonstrates strong cross-environment generalization on Phyre, ProcGen, and Android-like datasets. The approach yields superior temporal coherence and action-consistent generation compared to baselines, highlighting the practical impact of integrating Lie-group structure and object-centric representations for interactive, generative world models.

Abstract

Various world model frameworks are being developed today based on autoregressive frameworks that rely on discrete representations of actions and observations, and these frameworks are succeeding in constructing interactive generative models for the target environment of interest. Meanwhile, humans demonstrate remarkable generalization abilities to combine experiences in multiple environments to mentally simulate and learn to control agents in diverse environments. Inspired by this human capability, we introduce World modeling through Lie Action (WLA), an unsupervised framework that learns continuous latent action representations to simulate across environments. WLA learns a control interface with high controllability and predictive ability by simultaneously modeling the dynamics of multiple environments using Lie group theory and object-centric autoencoder. On synthetic benchmark and real-world datasets, we demonstrate that WLA can be trained using only video frames and, with minimal or no action labels, can quickly adapt to new environments with novel action sets.

Paper Structure

This paper contains 38 sections, 23 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The relation between observation space dynamics and latent space dynamics with Lie group action. We elaborate the design of $\Phi, \Psi$ and the implementation of $\mathcal{F}_{\Phi,\Psi}$ in Section \ref{['sec:implementation']}.
  • Figure 2: WLA is based on a slot-attention-based autoencoder. The latent space is partitioned into slots, and the transition for each slot occurs by a linear Lie group action. (a) During the training of WLA, the inverse dynamic map $\mathcal{F}_{\Phi,\Psi} {}$ converts the transition to a Lie algebra operator. (b) The forward rollout simulation with a controller interface is implemented by mapping the contextual inputs (past observations and external action signals) to Lie algebra parameters $(\lambda,\theta)$ and by multiplying the resulting operators $M$ to the slot token $z[t]$ to create the future observations autoregressively.
  • Figure 3: Phyre Interpolation. Top: Training frames at a low sampling rate (1 FPS). Bottom: Reconstructed trajectory at a high sampling rate (8 FPS), with interpolated frames shown in blue.
  • Figure 4: Composition results on Phyre. Left: Applying the sum of actions from the red and blue ball to the blue ball. Since the red ball is climbing, its action counteracts the falling action of the blue ball. Right: Applying the sum of actions to the red ball, showing similar compositional effects.
  • Figure 5: The controlling results through $\textrm{Ctrl} _{adapt}$ on ProcGen. The figure contains 6 out of 16 frames, resulting from applying the action sequence written below the rendered frames.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Definition 1