Table of Contents
Fetching ...

DrivingRecon: Large 4D Gaussian Reconstruction Model For Autonomous Driving

Hao Lu, Tianshuo Xu, Wenzhao Zheng, Yunpeng Zhang, Wei Zhan, Dalong Du, Masayoshi Tomizuka, Kurt Keutzer, Yingcong Chen

TL;DR

DrivingRecon addresses the challenge of fast, large-scale 4D reconstruction of driving scenes from surround-view videos. It introduces a feed-forward architecture that predicts 4D Gaussians, augmented by the Prune and Dilate Block (PD-Block) to adapt point distributions across views and complex edges, and by dynamic/static rendering with cross-temporal supervision. The method achieves superior reconstruction quality and novel view synthesis compared with state-of-the-art baselines, and demonstrates strong cross-scene generalization, as well as practical benefits for pre-training, vehicle adaptation, and scene editing. This work enables realistic driving scene synthesis and robust cross-domain transfer for downstream perception, planning, and simulation tasks.

Abstract

Photorealistic 4D reconstruction of street scenes is essential for developing real-world simulators in autonomous driving. However, most existing methods perform this task offline and rely on time-consuming iterative processes, limiting their practical applications. To this end, we introduce the Large 4D Gaussian Reconstruction Model (DrivingRecon), a generalizable driving scene reconstruction model, which directly predicts 4D Gaussian from surround view videos. To better integrate the surround-view images, the Prune and Dilate Block (PD-Block) is proposed to eliminate overlapping Gaussian points between adjacent views and remove redundant background points. To enhance cross-temporal information, dynamic and static decoupling is tailored to better learn geometry and motion features. Experimental results demonstrate that DrivingRecon significantly improves scene reconstruction quality and novel view synthesis compared to existing methods. Furthermore, we explore applications of DrivingRecon in model pre-training, vehicle adaptation, and scene editing. Our code is available at https://github.com/EnVision-Research/DriveRecon.

DrivingRecon: Large 4D Gaussian Reconstruction Model For Autonomous Driving

TL;DR

DrivingRecon addresses the challenge of fast, large-scale 4D reconstruction of driving scenes from surround-view videos. It introduces a feed-forward architecture that predicts 4D Gaussians, augmented by the Prune and Dilate Block (PD-Block) to adapt point distributions across views and complex edges, and by dynamic/static rendering with cross-temporal supervision. The method achieves superior reconstruction quality and novel view synthesis compared with state-of-the-art baselines, and demonstrates strong cross-scene generalization, as well as practical benefits for pre-training, vehicle adaptation, and scene editing. This work enables realistic driving scene synthesis and robust cross-domain transfer for downstream perception, planning, and simulation tasks.

Abstract

Photorealistic 4D reconstruction of street scenes is essential for developing real-world simulators in autonomous driving. However, most existing methods perform this task offline and rely on time-consuming iterative processes, limiting their practical applications. To this end, we introduce the Large 4D Gaussian Reconstruction Model (DrivingRecon), a generalizable driving scene reconstruction model, which directly predicts 4D Gaussian from surround view videos. To better integrate the surround-view images, the Prune and Dilate Block (PD-Block) is proposed to eliminate overlapping Gaussian points between adjacent views and remove redundant background points. To enhance cross-temporal information, dynamic and static decoupling is tailored to better learn geometry and motion features. Experimental results demonstrate that DrivingRecon significantly improves scene reconstruction quality and novel view synthesis compared to existing methods. Furthermore, we explore applications of DrivingRecon in model pre-training, vehicle adaptation, and scene editing. Our code is available at https://github.com/EnVision-Research/DriveRecon.

Paper Structure

This paper contains 19 sections, 2 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: The overview. Leveraging temporal multi-view images, the Large 4D Gaussian Reconstruction Model (DrivingRecon) is capable of predicting 4D driving scenes. DrivingRecon serves as a pre-trained model that effectively captures geometric and motion information, thereby enhancing performance in perception, tracking, and planning tasks. Additionally, DrivingRecon can synthesize novel views based on specific camera parameters, ensuring adaptability to various vehicle models. Furthermore, DrivingRecon facilitates the editing of designated 4D scenes through the removal, insertion, and manipulation of objects.
  • Figure 2: The overview of DrivingRecon. (a) Multi-view images are in turn sent to encoder, 3D-aware positional encoding, temporal cross-attention, decoder, and Gaussian adaptor to directly predict 4D Gaussians. (b) The 3D-aware Positional Encoding (3D-PE) leverages DepthNet, alongside camera parameters, to compute 3D world coordinates. These coordinates are integrated with the image features to enhance geometry awareness. (c) The visual encoder comprises multiple 2D convolutional blocks, while the visual decoder includes both 2D convolutional blocks and PD-Blocks. Details of the PD-Block are provided in Sec. \ref{['PD']}. (d) For dynamic objects, we only use next time-step images to supervise the current Gaussian parameters. For static scenes, rendering supervision is used across timestamps. In addition, reconstruction loss is also applied.
  • Figure 3: The motivation and details of Prune and Dilate Block (PD-Block). (a) Different views predict repeated Gaussian points, causing the model collapse. (b) Simple backgrounds (blue dots) do not need a large number of Gaussian dots to be represented, while complex objects (red dots) need more Gaussian dots to be represented. (c) PD-Block fuse the multi-view image features into a range view form. Then PD-Block prune and dilate the Gaussian points according to the complexity of the scene.
  • Figure 4: The qualitative comparison of reconstruction performance. The blue box indicates that there will be a large number of empty areas without Gaussian points. The red areas indicate areas where our approach is clear across perspectives.
  • Figure 5: Novel view rendering. Based on the predicted Gaussians, we render different views at different times. The novel views are of very high quality and very high spatio-temporal consistency (zoom in for the best view.)
  • ...and 6 more figures