Table of Contents
Fetching ...

GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control

Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro M B Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, Marco Cannici, Elie Aljalbout, Botao Ye, Xi Wang, Aram Davtyan, Mathieu Salzmann, Davide Scaramuzza, Marc Pollefeys, Paolo Favaro, Alexandre Alahi

TL;DR

GEM addresses the need for a controllable, multimodal ego-vision world model capable of long-horizon prediction. It introduces three controllable channels—ego-motion via ego-trajectories, object-level scene composition via DINOv2 tokens, and human pose via pose maps—along with depth generation, all trained on a large, diverse, pseudo-labeled dataset. The model uses a diffusion backbone with autoregressive, staged training to achieve temporally coherent long sequences and provides a new COM metric to quantify object-manipulation controllability. Experimental results show improved long-horizon quality and strong controllability across autonomous driving, egocentric activities, and drone domains, with open-source code, models, and datasets. This work lays a foundation for adaptable, controllable world models in multimodal ego-vision applications.

Abstract

We present GEM, a Generalizable Ego-vision Multimodal world model that predicts future frames using a reference frame, sparse features, human poses, and ego-trajectories. Hence, our model has precise control over object dynamics, ego-agent motion and human poses. GEM generates paired RGB and depth outputs for richer spatial understanding. We introduce autoregressive noise schedules to enable stable long-horizon generations. Our dataset is comprised of 4000+ hours of multimodal data across domains like autonomous driving, egocentric human activities, and drone flights. Pseudo-labels are used to get depth maps, ego-trajectories, and human poses. We use a comprehensive evaluation framework, including a new Control of Object Manipulation (COM) metric, to assess controllability. Experiments show GEM excels at generating diverse, controllable scenarios and temporal consistency over long generations. Code, models, and datasets are fully open-sourced.

GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control

TL;DR

GEM addresses the need for a controllable, multimodal ego-vision world model capable of long-horizon prediction. It introduces three controllable channels—ego-motion via ego-trajectories, object-level scene composition via DINOv2 tokens, and human pose via pose maps—along with depth generation, all trained on a large, diverse, pseudo-labeled dataset. The model uses a diffusion backbone with autoregressive, staged training to achieve temporally coherent long sequences and provides a new COM metric to quantify object-manipulation controllability. Experimental results show improved long-horizon quality and strong controllability across autonomous driving, egocentric activities, and drone domains, with open-source code, models, and datasets. This work lays a foundation for adaptable, controllable world models in multimodal ego-vision applications.

Abstract

We present GEM, a Generalizable Ego-vision Multimodal world model that predicts future frames using a reference frame, sparse features, human poses, and ego-trajectories. Hence, our model has precise control over object dynamics, ego-agent motion and human poses. GEM generates paired RGB and depth outputs for richer spatial understanding. We introduce autoregressive noise schedules to enable stable long-horizon generations. Our dataset is comprised of 4000+ hours of multimodal data across domains like autonomous driving, egocentric human activities, and drone flights. Pseudo-labels are used to get depth maps, ego-trajectories, and human poses. We use a comprehensive evaluation framework, including a new Control of Object Manipulation (COM) metric, to assess controllability. Experiments show GEM excels at generating diverse, controllable scenarios and temporal consistency over long generations. Code, models, and datasets are fully open-sourced.

Paper Structure

This paper contains 31 sections, 13 figures, 8 tables, 1 algorithm.

Figures (13)

  • Figure 1: GEM generates two modalities by taking as inputs a reference frame and noisy latents of images and depth modalities. The denoiser network, $D_{\theta}$ is conditioned on ego trajectories, DINOv2 features and human poses. Ego-trajectories are added using a cross attention LoRA at every block of the network. DINOv2 features and human poses are added to the output of each block in the input layers of the denoiser. To handle multimodal outputs, we use different output convolution-based projection layers $P$.
  • Figure 2: During training the sparse DINOv2 features from frame $t_i$ are translated to frame $\tau_i$ using the corresponding optical flow.
  • Figure 3: Visualization of our dynamic autoregressive sampling noise schedule for denoising 6 frames in total with a window size of 3 frames and 3 sampling steps.
  • Figure 4: FVD and FID comparison for the long generations of GEM and Vista vista.
  • Figure 5: Qualitative results for GEM's controllability. GEM can flexibly move objects (top-left), insert new objects (top-right), change ego trajectories (bottom-left) and change human poses (bottom-right). Refer to our https://vita-epfl.github.io/GEM.github.io/ for more videos.
  • ...and 8 more figures