Table of Contents
Fetching ...

DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation

Jiazhe Guo, Yikang Ding, Xiwu Chen, Shuo Chen, Bohan Li, Yingshuang Zou, Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, Zhiheng Li, Hao Zhao

TL;DR

<3-5 sentence high-level summary> DiST-4D tackles the problem of generating dynamic 4D driving scenes with temporal extrapolation and spatial NVS without per-scene optimization. It introduces metric depth as a core geometric representation and deploys a disentangled dual-diffusion framework (DiST-T for temporal RGB-D generation and DiST-S for spatial NVS), complemented by a metric-depth curation pipeline and a self-supervised cycle consistency strategy. The approach demonstrates state-of-the-art performance on temporal generation and novel-view synthesis on nuScenes, with competitive downstream planning results. The combination of a scalable, feed-forward design and depth-based geometry offers practical potential for autonomous driving data generation and simulation.

Abstract

Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. A key challenge lies in finding an efficient and generalizable geometric representation that seamlessly connects temporal and spatial synthesis. To address this, we propose DiST-4D, the first disentangled spatiotemporal diffusion framework for 4D driving scene generation, which leverages metric depth as the core geometric representation. DiST-4D decomposes the problem into two diffusion processes: DiST-T, which predicts future metric depth and multi-view RGB sequences directly from past observations, and DiST-S, which enables spatial NVS by training only on existing viewpoints while enforcing cycle consistency. This cycle consistency mechanism introduces a forward-backward rendering constraint, reducing the generalization gap between observed and unseen viewpoints. Metric depth is essential for both accurate reliable forecasting and accurate spatial NVS, as it provides a view-consistent geometric representation that generalizes well to unseen perspectives. Experiments demonstrate that DiST-4D achieves state-of-the-art performance in both temporal prediction and NVS tasks, while also delivering competitive performance in planning-related evaluations.

DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation

TL;DR

<3-5 sentence high-level summary> DiST-4D tackles the problem of generating dynamic 4D driving scenes with temporal extrapolation and spatial NVS without per-scene optimization. It introduces metric depth as a core geometric representation and deploys a disentangled dual-diffusion framework (DiST-T for temporal RGB-D generation and DiST-S for spatial NVS), complemented by a metric-depth curation pipeline and a self-supervised cycle consistency strategy. The approach demonstrates state-of-the-art performance on temporal generation and novel-view synthesis on nuScenes, with competitive downstream planning results. The combination of a scalable, feed-forward design and depth-based geometry offers practical potential for autonomous driving data generation and simulation.

Abstract

Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. A key challenge lies in finding an efficient and generalizable geometric representation that seamlessly connects temporal and spatial synthesis. To address this, we propose DiST-4D, the first disentangled spatiotemporal diffusion framework for 4D driving scene generation, which leverages metric depth as the core geometric representation. DiST-4D decomposes the problem into two diffusion processes: DiST-T, which predicts future metric depth and multi-view RGB sequences directly from past observations, and DiST-S, which enables spatial NVS by training only on existing viewpoints while enforcing cycle consistency. This cycle consistency mechanism introduces a forward-backward rendering constraint, reducing the generalization gap between observed and unseen viewpoints. Metric depth is essential for both accurate reliable forecasting and accurate spatial NVS, as it provides a view-consistent geometric representation that generalizes well to unseen perspectives. Experiments demonstrate that DiST-4D achieves state-of-the-art performance in both temporal prediction and NVS tasks, while also delivering competitive performance in planning-related evaluations.

Paper Structure

This paper contains 35 sections, 3 equations, 17 figures, 8 tables.

Figures (17)

  • Figure 1: Overall framework of the proposed DiST-4D. DiST-4D is a disentangled spatiotemporal diffusion framework for 4D driving scene generation, leveraging metric depth as the core geometric representation to enable both temporal extrapolation and spatial novel view synthesis (NVS). (Top: Temporal Generation) DiST-T employs a diffusion model to predict future multi-camera RGB-D sequences from historical multi-camera images and control signals. The generated RGB-D sequences are then aggregated into point clouds, allowing for bullet time rendering. (Bottom: Spatial Generation) To enable spatial NVS, DiST-S leverages the predicted RGB-D sequences to generate novel viewpoints by first projecting them into sparse conditions and then refining them into dense RGB-D outputs.
  • Figure 2: Illustration of the metric depth curation pipeline. First, a multi-view stereo network processes multi-camera videos to produce a static scene point cloud. Simultaneously, sparse LiDAR point clouds are collected and fused with the MVS output to obtain an aggregated metric depth prompt. This intermediate depth representation serves as input to a generative depth completion network, which refines and densifies the depth estimates, producing high-fidelity dense metric depth maps.
  • Figure 3: Illustration of the diffusion transformer of DiST-T. STDiT captures multi-camera and multi-frame dependencies for future scene generation. It processes latent features with temporal and spatial STDiT blocks, integrating ego trajectory ($\mathbf{A}$), 3D object bounding boxes ($\mathbf{B}$), camera poses ($\mathbf{P}$) and map sequences ($\mathbf{M}$) as control signals.
  • Figure 4: (Top) In the first stage, DiST-S is trained on the original trajectory using $n$ frame projection (Bottom) In the second stage, self-supervised cycle consistency (SCC) is introduced, where novel trajectories are generated, and DiST-S learns to project between original and novel viewpoints.
  • Figure 5: Visualization results of generated surround depth. Compared to M$^2$Depth zou2024m2depth (M2D) and SurroundDepth wei2023surrounddepth (SD), DiST-T produces more fine-grained depth with enhanced details.
  • ...and 12 more figures