Table of Contents
Fetching ...

DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images

Xiaoxue Chen, Ziyi Xiong, Yuantao Chen, Gen Li, Nan Wang, Hongcheng Luo, Long Chen, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Hongyang Li, Ya-Qin Zhang, Hao Zhao

TL;DR

This work tackles the need for fast, scalable 4D reconstruction of dynamic driving scenes from unposed images. It introduces Driving Gaussian Grounded Transformer (DGGT), a pose-free feedforward framework that predicts per-frame camera parameters, pixel-aligned 3D Gaussians, dynamic components, and a lifespan to model temporal visibility, augmented by a 3D motion head and a diffusion-based refinement for high-fidelity rendering. Key contributions include eliminating pose inputs, handling arbitrary sequence lengths, enabling scene editing at the Gaussian level, and achieving state-of-the-art performance with fast inference on Waymo, while also transferring well to nuScenes and Argoverse2 in zero-shot settings. The approach promises practical impact for large-scale training and evaluation in autonomous driving, offering a robust, editable 4D representation suitable for downstream tasks and simulation.

Abstract

Autonomous driving needs fast, scalable 4D reconstruction and re-simulation for training and evaluation, yet most methods for dynamic driving scenes still rely on per-scene optimization, known camera calibration, or short frame windows, making them slow and impractical. We revisit this problem from a feedforward perspective and introduce \textbf{Driving Gaussian Grounded Transformer (DGGT)}, a unified framework for pose-free dynamic scene reconstruction. We note that the existing formulations, treating camera pose as a required input, limit flexibility and scalability. Instead, we reformulate pose as an output of the model, enabling reconstruction directly from sparse, unposed images and supporting an arbitrary number of views for long sequences. Our approach jointly predicts per-frame 3D Gaussian maps and camera parameters, disentangles dynamics with a lightweight dynamic head, and preserves temporal consistency with a lifespan head that modulates visibility over time. A diffusion-based rendering refinement further reduces motion/interpolation artifacts and improves novel-view quality under sparse inputs. The result is a single-pass, pose-free algorithm that achieves state-of-the-art performance and speed. Trained and evaluated on large-scale driving benchmarks (Waymo, nuScenes, Argoverse2), our method outperforms prior work both when trained on each dataset and in zero-shot transfer across datasets, and it scales well as the number of input frames increases.

DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images

TL;DR

This work tackles the need for fast, scalable 4D reconstruction of dynamic driving scenes from unposed images. It introduces Driving Gaussian Grounded Transformer (DGGT), a pose-free feedforward framework that predicts per-frame camera parameters, pixel-aligned 3D Gaussians, dynamic components, and a lifespan to model temporal visibility, augmented by a 3D motion head and a diffusion-based refinement for high-fidelity rendering. Key contributions include eliminating pose inputs, handling arbitrary sequence lengths, enabling scene editing at the Gaussian level, and achieving state-of-the-art performance with fast inference on Waymo, while also transferring well to nuScenes and Argoverse2 in zero-shot settings. The approach promises practical impact for large-scale training and evaluation in autonomous driving, offering a robust, editable 4D representation suitable for downstream tasks and simulation.

Abstract

Autonomous driving needs fast, scalable 4D reconstruction and re-simulation for training and evaluation, yet most methods for dynamic driving scenes still rely on per-scene optimization, known camera calibration, or short frame windows, making them slow and impractical. We revisit this problem from a feedforward perspective and introduce \textbf{Driving Gaussian Grounded Transformer (DGGT)}, a unified framework for pose-free dynamic scene reconstruction. We note that the existing formulations, treating camera pose as a required input, limit flexibility and scalability. Instead, we reformulate pose as an output of the model, enabling reconstruction directly from sparse, unposed images and supporting an arbitrary number of views for long sequences. Our approach jointly predicts per-frame 3D Gaussian maps and camera parameters, disentangles dynamics with a lightweight dynamic head, and preserves temporal consistency with a lifespan head that modulates visibility over time. A diffusion-based rendering refinement further reduces motion/interpolation artifacts and improves novel-view quality under sparse inputs. The result is a single-pass, pose-free algorithm that achieves state-of-the-art performance and speed. Trained and evaluated on large-scale driving benchmarks (Waymo, nuScenes, Argoverse2), our method outperforms prior work both when trained on each dataset and in zero-shot transfer across datasets, and it scales well as the number of input frames increases.

Paper Structure

This paper contains 32 sections, 12 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Left: Our feedforward framework reconstructs dynamic driving scenes directly from unposed images within 0.4 seconds, producing outputs such as camera pose, 3D Gaussian tracking, depth, and dynamic maps, which further enable instance-level scene editing. Right: Quantitative comparison shows that our method achieves state-of-the-art reconstruction quality with competitive inference speed, outperforming prior feedforward approaches in both accuracy and efficiency.(using single-view input as an example)
  • Figure 2: Overall Architecture. Given unposed images of dynamic scene, we estimate camera parameters, dynamic maps, and per-pixel Gaussians in a single pass. Subsequently, a motion head is employed to track dynamic objects across time, and their trajectories are interpolated to construct temporally consistent Gaussian representations. Finally, a diffusion-based rendering module refines the resulting composition, producing high-fidelity renderings.
  • Figure 3: Qualitative comparison of different methods on Waymo dataset. (results shown are for the forward-facing camera)
  • Figure 4: 3D Tracking Visualization. Points with the same color correspond across frames.
  • Figure 5: Scene editing results. Cars can be removed or shifted (row 1), and novel vehicles/cyclists inserted from other scenes (row 2). Diffusion refinement fixes artifacts such as holes (red box).
  • ...and 3 more figures