Table of Contents
Fetching ...

D$^2$-World: An Efficient World Model through Decoupled Dynamic Flow

Haiming Zhang, Xu Yan, Ying Xue, Zixuan Guo, Shuguang Cui, Zhen Li, Bingbing Liu

TL;DR

D$^2$-World is introduced, a novel World model that effectively forecasts future point clouds through Decoupled Dynamic flow, and achieves state-of-the-art performance on the OpenScene Predictive World Model benchmark, securing second place, and trains more than 300% faster than the baseline model.

Abstract

This technical report summarizes the second-place solution for the Predictive World Model Challenge held at the CVPR-2024 Workshop on Foundation Models for Autonomous Systems. We introduce D$^2$-World, a novel World model that effectively forecasts future point clouds through Decoupled Dynamic flow. Specifically, the past semantic occupancies are obtained via existing occupancy networks (e.g., BEVDet). Following this, the occupancy results serve as the input for a single-stage world model, generating future occupancy in a non-autoregressive manner. To further simplify the task, dynamic voxel decoupling is performed in the world model. The model generates future dynamic voxels by warping the existing observations through voxel flow, while remaining static voxels can be easily obtained through pose transformation. As a result, our approach achieves state-of-the-art performance on the OpenScene Predictive World Model benchmark, securing second place, and trains more than 300% faster than the baseline model. Code is available at https://github.com/zhanghm1995/D2-World.

D$^2$-World: An Efficient World Model through Decoupled Dynamic Flow

TL;DR

D-World is introduced, a novel World model that effectively forecasts future point clouds through Decoupled Dynamic flow, and achieves state-of-the-art performance on the OpenScene Predictive World Model benchmark, securing second place, and trains more than 300% faster than the baseline model.

Abstract

This technical report summarizes the second-place solution for the Predictive World Model Challenge held at the CVPR-2024 Workshop on Foundation Models for Autonomous Systems. We introduce D-World, a novel World model that effectively forecasts future point clouds through Decoupled Dynamic flow. Specifically, the past semantic occupancies are obtained via existing occupancy networks (e.g., BEVDet). Following this, the occupancy results serve as the input for a single-stage world model, generating future occupancy in a non-autoregressive manner. To further simplify the task, dynamic voxel decoupling is performed in the world model. The model generates future dynamic voxels by warping the existing observations through voxel flow, while remaining static voxels can be easily obtained through pose transformation. As a result, our approach achieves state-of-the-art performance on the OpenScene Predictive World Model benchmark, securing second place, and trains more than 300% faster than the baseline model. Code is available at https://github.com/zhanghm1995/D2-World.

Paper Structure

This paper contains 8 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The overall pipeline of D$^2$-World. In the first stage, we train a single-frame occupancy network, and in the second stage, we train a world model that takes past occupancy as input, forecasting future point clouds.
  • Figure 2: Inner structure of SALT & warping and refinement. (a) The detailed structures of SALT, which replace the MLP and FFN (Feed Forward Network) in vanilla transformer with 2D convolutions and 3D convolutions respectively for capturing spatial-temporal dependencies. (b) We decouple the flow with the dynamic and static flow and warp the feature of the current frame for forecasting the future frame. The refinement module refines the coarse warping features.