GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control

Mariam Hassan; Sebastian Stapf; Ahmad Rahimi; Pedro M B Rezende; Yasaman Haghighi; David Brüggemann; Isinsu Katircioglu; Lin Zhang; Xiaoran Chen; Suman Saha; Marco Cannici; Elie Aljalbout; Botao Ye; Xi Wang; Aram Davtyan; Mathieu Salzmann; Davide Scaramuzza; Marc Pollefeys; Paolo Favaro; Alexandre Alahi

GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control

Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro M B Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, Marco Cannici, Elie Aljalbout, Botao Ye, Xi Wang, Aram Davtyan, Mathieu Salzmann, Davide Scaramuzza, Marc Pollefeys, Paolo Favaro, Alexandre Alahi

TL;DR

GEM addresses the need for a controllable, multimodal ego-vision world model capable of long-horizon prediction. It introduces three controllable channels—ego-motion via ego-trajectories, object-level scene composition via DINOv2 tokens, and human pose via pose maps—along with depth generation, all trained on a large, diverse, pseudo-labeled dataset. The model uses a diffusion backbone with autoregressive, staged training to achieve temporally coherent long sequences and provides a new COM metric to quantify object-manipulation controllability. Experimental results show improved long-horizon quality and strong controllability across autonomous driving, egocentric activities, and drone domains, with open-source code, models, and datasets. This work lays a foundation for adaptable, controllable world models in multimodal ego-vision applications.

Abstract

We present GEM, a Generalizable Ego-vision Multimodal world model that predicts future frames using a reference frame, sparse features, human poses, and ego-trajectories. Hence, our model has precise control over object dynamics, ego-agent motion and human poses. GEM generates paired RGB and depth outputs for richer spatial understanding. We introduce autoregressive noise schedules to enable stable long-horizon generations. Our dataset is comprised of 4000+ hours of multimodal data across domains like autonomous driving, egocentric human activities, and drone flights. Pseudo-labels are used to get depth maps, ego-trajectories, and human poses. We use a comprehensive evaluation framework, including a new Control of Object Manipulation (COM) metric, to assess controllability. Experiments show GEM excels at generating diverse, controllable scenarios and temporal consistency over long generations. Code, models, and datasets are fully open-sourced.

GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control

TL;DR

Abstract

GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)