Inter-environmental world modeling for continuous and compositional dynamics
Kohei Hayashi, Masanori Koyama, Julian Jorge Andrade Guerreiro
TL;DR
This work addresses the challenge of generalizing world models across diverse environments by learning an environment-agnostic simulator that embraces continuous and compositional actions. It introduces World modeling through Lie Action (WLA), which uses a Lie group to model transitions in a latent space that is partitioned into object-centric slots, enabling linearized, compositional dynamics that lift to the observation space via an equivariant autoencoder. By training an adaptive controller in the latent Lie-algebra space and coupling it with an inverse dynamic map, WLA solves the Controller Interface Problem in a structured, minimally-labeled setting and demonstrates strong cross-environment generalization on Phyre, ProcGen, and Android-like datasets. The approach yields superior temporal coherence and action-consistent generation compared to baselines, highlighting the practical impact of integrating Lie-group structure and object-centric representations for interactive, generative world models.
Abstract
Various world model frameworks are being developed today based on autoregressive frameworks that rely on discrete representations of actions and observations, and these frameworks are succeeding in constructing interactive generative models for the target environment of interest. Meanwhile, humans demonstrate remarkable generalization abilities to combine experiences in multiple environments to mentally simulate and learn to control agents in diverse environments. Inspired by this human capability, we introduce World modeling through Lie Action (WLA), an unsupervised framework that learns continuous latent action representations to simulate across environments. WLA learns a control interface with high controllability and predictive ability by simultaneously modeling the dynamics of multiple environments using Lie group theory and object-centric autoencoder. On synthetic benchmark and real-world datasets, we demonstrate that WLA can be trained using only video frames and, with minimal or no action labels, can quickly adapt to new environments with novel action sets.
