MUVO: A Multimodal Generative World Model for Autonomous Driving with Geometric Representations
Daniel Bogdoll, Yitian Yang, Tim Joseph, Melih Yazgan, J. Marius Zöllner
TL;DR
MUVO presents a self-supervised, BEV-free multimodal world model for autonomous driving that fuses camera and lidar data using geometric voxel representations. It leverages a transformer-based fusion and a 2D latent space to predict future observations and 3D occupancy without relying on BEV features, and it demonstrates how sensor fusion design choices and occupancy pretraining impact prediction quality. The study provides detailed comparisons across fusion strategies and latent-space configurations, showing that lossless range-view lidar encodings and 2D latent spaces yield strong camera and occupancy performance, while BEV mappings can hinder results. The findings offer practical guidance for building scalable, action-conditioned world models with multimodal sensing, and the work provides a public codebase to enable further research.
Abstract
World models for autonomous driving have the potential to dramatically improve the reasoning capabilities of today's systems. However, most works focus on camera data, with only a few that leverage lidar data or combine both to better represent autonomous vehicle sensor setups. In addition, raw sensor predictions are less actionable than 3D occupancy predictions, but there are no works examining the effects of combining both multimodal sensor data and 3D occupancy prediction. In this work, we perform a set of experiments with a MUltimodal World Model with Geometric VOxel representations (MUVO) to evaluate different sensor fusion strategies to better understand the effects on sensor data prediction. We also analyze potential weaknesses of current sensor fusion approaches and examine the benefits of additionally predicting 3D occupancy.
