Table of Contents
Fetching ...

ReconDrive: Fast Feed-Forward 4D Gaussian Splatting for Autonomous Driving Scene Reconstruction

Haibao Yu, Kuntao Xiao, Jiahang Wang, Ruiyang Hao, Yuxin Huang, Guoran Hu, Haifang Qin, Bowen Jing, Yuntian Bo, Ping Luo

TL;DR

ReconDrive is proposed, a feed-forward framework that leverages and extends the 3D foundation model VGGT for rapid, high-fidelity 4DGS generation and achieves performance competitive with per-scene optimization while being orders of magnitude faster, providing a scalable and practical solution for realistic driving simulation.

Abstract

High-fidelity visual reconstruction and novel-view synthesis are essential for realistic closed-loop evaluation in autonomous driving. While 4D Gaussian Splatting (4DGS) offers a promising balance of accuracy and efficiency, existing per-scene optimization methods require costly iterative refinement, rendering them unscalable for extensive urban environments. Conversely, current feed-forward approaches often suffer from degraded photometric quality. To address these limitations, we propose ReconDrive, a feed-forward framework that leverages and extends the 3D foundation model VGGT for rapid, high-fidelity 4DGS generation. Our architecture introduces two core adaptations to tailor the foundation model to dynamic driving scenes: (1) Hybrid Gaussian Prediction Heads, which decouple the regression of spatial coordinates and appearance attributes to overcome the photometric deficiencies inherent in generalized foundation features; and (2) a Static-Dynamic 4D Composition strategy that explicitly captures temporal motion via velocity modeling to represent complex dynamic environments. Benchmarked on nuScenes, ReconDrive significantly outperforms existing feed-forward baselines in reconstruction, novel-view synthesis, and 3D perception. It achieves performance competitive with per-scene optimization while being orders of magnitude faster, providing a scalable and practical solution for realistic driving simulation.

ReconDrive: Fast Feed-Forward 4D Gaussian Splatting for Autonomous Driving Scene Reconstruction

TL;DR

ReconDrive is proposed, a feed-forward framework that leverages and extends the 3D foundation model VGGT for rapid, high-fidelity 4DGS generation and achieves performance competitive with per-scene optimization while being orders of magnitude faster, providing a scalable and practical solution for realistic driving simulation.

Abstract

High-fidelity visual reconstruction and novel-view synthesis are essential for realistic closed-loop evaluation in autonomous driving. While 4D Gaussian Splatting (4DGS) offers a promising balance of accuracy and efficiency, existing per-scene optimization methods require costly iterative refinement, rendering them unscalable for extensive urban environments. Conversely, current feed-forward approaches often suffer from degraded photometric quality. To address these limitations, we propose ReconDrive, a feed-forward framework that leverages and extends the 3D foundation model VGGT for rapid, high-fidelity 4DGS generation. Our architecture introduces two core adaptations to tailor the foundation model to dynamic driving scenes: (1) Hybrid Gaussian Prediction Heads, which decouple the regression of spatial coordinates and appearance attributes to overcome the photometric deficiencies inherent in generalized foundation features; and (2) a Static-Dynamic 4D Composition strategy that explicitly captures temporal motion via velocity modeling to represent complex dynamic environments. Benchmarked on nuScenes, ReconDrive significantly outperforms existing feed-forward baselines in reconstruction, novel-view synthesis, and 3D perception. It achieves performance competitive with per-scene optimization while being orders of magnitude faster, providing a scalable and practical solution for realistic driving simulation.
Paper Structure (39 sections, 12 equations, 10 figures, 3 tables)

This paper contains 39 sections, 12 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: ReconDrive Inference Framework. ReconDrive is a powerful feed-forward 4D Gaussian splatting generation framework tailored for urban scene reconstruction and novel-view synthesis. In this framework, we select two context frames from each segment of an urban scene as input, and adopt a static-dynamic composition strategy to represent 4D Gaussians. To further enhance geometric precision, we design a dedicated Gaussian prediction head, consisting of a Gaussian Parameter Prediction Head (GPPH) and a Gaussian Center Prediction Head (GCPH), which enables the generation of Gaussians with more accurate geometric attributes.
  • Figure 2: Mapping Among Images, Masks, Dense Features, and Gaussian Indices. (a) Input image. (b) Dynamic object mask. (c) Dense features. (d) Gaussian kernels. Each pixel in the dense feature map generates a corresponding Gaussian kernel, enabling consistent pixel-wise mapping across all these components.
  • Figure 3: Visual Comparisons of Scene Reconstruction and Novel-View Synthesis. Compared with per-scene optimization methods (Street Gaussians, PVG, DeformableGS, OminiRe), our ReconDrive maintains high-quality visual rendering in both scene reconstruction (Original View) and novel-view synthesis (1–3 meters of lateral movement leftward and rightward). Compared with the feed-forward method DrivingForward, ReconDrive preserves accurate geometric consistency—evident in the tree and the surroundings—and the performance improvement becomes more pronounced as the moving distance increases. Additionally, it exhibits less image distortion and blurriness, particularly in image boundary regions. Extended visual comparisons are provided in the Appendix.
  • Figure 4: Project Loss Computing Pipeline. Note that $K_t$ and $K_s$ are identical since the camera is fixed.
  • Figure 5: Distribution of Spatial Distances between the Point Maps Generated by the Original VGGT and Ground Truth LiDAR Point Clouds, highlighting significant metric misalignment.
  • ...and 5 more figures