Table of Contents
Fetching ...

MUVO: A Multimodal Generative World Model for Autonomous Driving with Geometric Representations

Daniel Bogdoll, Yitian Yang, Tim Joseph, Melih Yazgan, J. Marius Zöllner

TL;DR

MUVO presents a self-supervised, BEV-free multimodal world model for autonomous driving that fuses camera and lidar data using geometric voxel representations. It leverages a transformer-based fusion and a 2D latent space to predict future observations and 3D occupancy without relying on BEV features, and it demonstrates how sensor fusion design choices and occupancy pretraining impact prediction quality. The study provides detailed comparisons across fusion strategies and latent-space configurations, showing that lossless range-view lidar encodings and 2D latent spaces yield strong camera and occupancy performance, while BEV mappings can hinder results. The findings offer practical guidance for building scalable, action-conditioned world models with multimodal sensing, and the work provides a public codebase to enable further research.

Abstract

World models for autonomous driving have the potential to dramatically improve the reasoning capabilities of today's systems. However, most works focus on camera data, with only a few that leverage lidar data or combine both to better represent autonomous vehicle sensor setups. In addition, raw sensor predictions are less actionable than 3D occupancy predictions, but there are no works examining the effects of combining both multimodal sensor data and 3D occupancy prediction. In this work, we perform a set of experiments with a MUltimodal World Model with Geometric VOxel representations (MUVO) to evaluate different sensor fusion strategies to better understand the effects on sensor data prediction. We also analyze potential weaknesses of current sensor fusion approaches and examine the benefits of additionally predicting 3D occupancy.

MUVO: A Multimodal Generative World Model for Autonomous Driving with Geometric Representations

TL;DR

MUVO presents a self-supervised, BEV-free multimodal world model for autonomous driving that fuses camera and lidar data using geometric voxel representations. It leverages a transformer-based fusion and a 2D latent space to predict future observations and 3D occupancy without relying on BEV features, and it demonstrates how sensor fusion design choices and occupancy pretraining impact prediction quality. The study provides detailed comparisons across fusion strategies and latent-space configurations, showing that lossless range-view lidar encodings and 2D latent spaces yield strong camera and occupancy performance, while BEV mappings can hinder results. The findings offer practical guidance for building scalable, action-conditioned world models with multimodal sensing, and the work provides a public codebase to enable further research.

Abstract

World models for autonomous driving have the potential to dramatically improve the reasoning capabilities of today's systems. However, most works focus on camera data, with only a few that leverage lidar data or combine both to better represent autonomous vehicle sensor setups. In addition, raw sensor predictions are less actionable than 3D occupancy predictions, but there are no works examining the effects of combining both multimodal sensor data and 3D occupancy prediction. In this work, we perform a set of experiments with a MUltimodal World Model with Geometric VOxel representations (MUVO) to evaluate different sensor fusion strategies to better understand the effects on sensor data prediction. We also analyze potential weaknesses of current sensor fusion approaches and examine the benefits of additionally predicting 3D occupancy.
Paper Structure (12 sections, 1 equation, 6 figures)

This paper contains 12 sections, 1 equation, 6 figures.

Figures (6)

  • Figure 1: Qualitative output of a sensor fusion experiment with occupancy prediction activated. The predictions shown for camera and lidar sensors and 3D occupancy are based on past camera and lidar inputs.
  • Figure 2: MUVO Overview: Raw camera images and lidar point clouds are processed and fused. The resulting latent representations are fed into our transition model. Conditioned on actions, future states are predicted. Finally, future states are decoded into 3D occupancy grids, raw point clouds, and raw images.
  • Figure 3: Sensor Fusion: With $\mathcal{{D}}_{{val}}^{{RL}}$ we evaluate representation learning, while $\mathcal{{D}}_{{val}}^{{DS}}$ examines robustness. We examined feature averaging (AVG) chenInterpretableEndtoEndUrban2022, feature concatenation (FC) wu2022daydreamer, and a transformer-based architecture (TR) chittaTransFuserImitationTransformerBased2022. For lidar encodings, we evaluated PointPillars (PP) langPointPillarsFastEncoders2019 and a range view (RR) liVehicleDetection3D2016 representation followed by a ResNet heDeepResidualLearning2016. For camera, we evaluated direct encoding without BEV (WOB) and a BEV mapping philionLiftSplatShoot2020.
  • Figure 4: Two-dimensional Latent Space: We compare a 1D baseline to a set of 2D latent spaces, where we also examine the influence of a vision-transformer backbone and an additional perceptual loss term (PL). For the backbone, we evaluate ResNet18 (RN) and MobileVit-V2 (VIT).
  • Figure 5: Pre-Training: Influence of camera-lidar pre-training for 50,000 steps on 3D occupancy prediction. We evaluate on both $\mathcal{{D}}_{{val}}^{{RL}}$ and $\mathcal{{D}}_{{val}}^{{DS}}$. The green lines show our benchmark without pre-training. Violet lines show frozen weights of the pre-trained model, and weights remained open for the blue lines.
  • ...and 1 more figures