Table of Contents
Fetching ...

MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction

Jingcheng Ni, Yuxin Guo, Yichen Liu, Rui Chen, Lewei Lu, Zehuan Wu

TL;DR

<3-5 sentence high-level summary>

Abstract

World models that forecast environmental changes from actions are vital for autonomous driving models with strong generalization. The prevailing driving world model mainly build on video prediction model. Although these models can produce high-fidelity video sequences with advanced diffusion-based generator, they are constrained by their predictive duration and overall generalization capabilities. In this paper, we explore to solve this problem by combining generation loss with MAE-style feature-level context learning. In particular, we instantiate this target with three key design: (1) A more scalable Diffusion Transformer (DiT) structure trained with extra mask construction task. (2) we devise diffusion-related mask tokens to deal with the fuzzy relations between mask reconstruction and generative diffusion process. (3) we extend mask construction task to spatial-temporal domain by utilizing row-wise mask for shifted self-attention rather than masked self-attention in MAE. Then, we adopt a row-wise cross-view module to align with this mask design. Based on above improvement, we propose MaskGWM: a Generalizable driving World Model embodied with Video Mask reconstruction. Our model contains two variants: MaskGWM-long, focusing on long-horizon prediction, and MaskGWM-mview, dedicated to multi-view generation. Comprehensive experiments on standard benchmarks validate the effectiveness of the proposed method, which contain normal validation of Nuscene dataset, long-horizon rollout of OpenDV-2K dataset and zero-shot validation of Waymo dataset. Quantitative metrics on these datasets show our method notably improving state-of-the-art driving world model.

MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction

TL;DR

<3-5 sentence high-level summary>

Abstract

World models that forecast environmental changes from actions are vital for autonomous driving models with strong generalization. The prevailing driving world model mainly build on video prediction model. Although these models can produce high-fidelity video sequences with advanced diffusion-based generator, they are constrained by their predictive duration and overall generalization capabilities. In this paper, we explore to solve this problem by combining generation loss with MAE-style feature-level context learning. In particular, we instantiate this target with three key design: (1) A more scalable Diffusion Transformer (DiT) structure trained with extra mask construction task. (2) we devise diffusion-related mask tokens to deal with the fuzzy relations between mask reconstruction and generative diffusion process. (3) we extend mask construction task to spatial-temporal domain by utilizing row-wise mask for shifted self-attention rather than masked self-attention in MAE. Then, we adopt a row-wise cross-view module to align with this mask design. Based on above improvement, we propose MaskGWM: a Generalizable driving World Model embodied with Video Mask reconstruction. Our model contains two variants: MaskGWM-long, focusing on long-horizon prediction, and MaskGWM-mview, dedicated to multi-view generation. Comprehensive experiments on standard benchmarks validate the effectiveness of the proposed method, which contain normal validation of Nuscene dataset, long-horizon rollout of OpenDV-2K dataset and zero-shot validation of Waymo dataset. Quantitative metrics on these datasets show our method notably improving state-of-the-art driving world model.

Paper Structure

This paper contains 31 sections, 5 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: (a). MaskGWM improve fidelity and generalization from web-scale dataset, scalable DiT architecture and Mask Reconstruction (MR) target. (b) proposed MR apply a two branch structure for spatial context (scene objects) and temporal context (object motions)
  • Figure 2: Our model facilitates zero-shot generation, consistent long-horizon prediction and multi-view video generation.
  • Figure 3: Overview of the MaskGWM. We propose mask reconstruction containing token mask and token reconstruction as a complementary task for training dring world model. Token Mask: we randomly sample tokens by temporal-shared $\mathcal{M}_{spatial}$ and temporal-unshared $\mathcal{M}_{time}$, specialized for spatial and temporal modeling. Token Reconstruction: we fill invisible tokens by diffusion-related mask tokens (Sec.\ref{['sec:3.2']}) and recover features by a two-branch transformer. Moreover, we introduce a row-wise mask strategy (Sec.\ref{['sec:3.3']}) for temporal branch. $\rho=1-r$ is used for simplicity in encoder.
  • Figure 4: The comparison of different mask types and attention operations for temporal transformer block with mask reconstruction task. Attention mask is only applied when $\mathcal{M} = \mathcal{M}_{time}$
  • Figure 5: Long-horizon prediction results of MaskGWM. Our model is capable of forecasting long video sequences with stability, devoid of collapse or blurring issues.
  • ...and 7 more figures