Table of Contents
Fetching ...

DrivingForward: Feed-forward 3D Gaussian Splatting for Driving Scene Reconstruction from Flexible Surround-view Input

Qijian Tian, Xin Tan, Yuan Xie, Lizhuang Ma

TL;DR

DrivingForward tackles the challenge of real-time driving-scene reconstruction from sparse surround-view inputs by introducing a feed-forward Gaussian Splatting framework. It jointly learns a pose network, a scale-aware depth network, and a Gaussian-parameter network to predict and aggregate per-image primitives, enabling flexible multi-frame inputs and self-supervised depth scaling. The method achieves real-time inference and outperforms both feed-forward and scene-optimized baselines on nuScenes, demonstrating robustness to low overlap and variable input counts. The key contributions include scale-aware localization, per-image Gaussian parameter prediction, and end-to-end training that yields accurate, scalable driving-scene reconstructions without depth or extrinsic supervision during training.

Abstract

We propose DrivingForward, a feed-forward Gaussian Splatting model that reconstructs driving scenes from flexible surround-view input. Driving scene images from vehicle-mounted cameras are typically sparse, with limited overlap, and the movement of the vehicle further complicates the acquisition of camera extrinsics. To tackle these challenges and achieve real-time reconstruction, we jointly train a pose network, a depth network, and a Gaussian network to predict the Gaussian primitives that represent the driving scenes. The pose network and depth network determine the position of the Gaussian primitives in a self-supervised manner, without using depth ground truth and camera extrinsics during training. The Gaussian network independently predicts primitive parameters from each input image, including covariance, opacity, and spherical harmonics coefficients. At the inference stage, our model can achieve feed-forward reconstruction from flexible multi-frame surround-view input. Experiments on the nuScenes dataset show that our model outperforms existing state-of-the-art feed-forward and scene-optimized reconstruction methods in terms of reconstruction.

DrivingForward: Feed-forward 3D Gaussian Splatting for Driving Scene Reconstruction from Flexible Surround-view Input

TL;DR

DrivingForward tackles the challenge of real-time driving-scene reconstruction from sparse surround-view inputs by introducing a feed-forward Gaussian Splatting framework. It jointly learns a pose network, a scale-aware depth network, and a Gaussian-parameter network to predict and aggregate per-image primitives, enabling flexible multi-frame inputs and self-supervised depth scaling. The method achieves real-time inference and outperforms both feed-forward and scene-optimized baselines on nuScenes, demonstrating robustness to low overlap and variable input counts. The key contributions include scale-aware localization, per-image Gaussian parameter prediction, and end-to-end training that yields accurate, scalable driving-scene reconstructions without depth or extrinsic supervision during training.

Abstract

We propose DrivingForward, a feed-forward Gaussian Splatting model that reconstructs driving scenes from flexible surround-view input. Driving scene images from vehicle-mounted cameras are typically sparse, with limited overlap, and the movement of the vehicle further complicates the acquisition of camera extrinsics. To tackle these challenges and achieve real-time reconstruction, we jointly train a pose network, a depth network, and a Gaussian network to predict the Gaussian primitives that represent the driving scenes. The pose network and depth network determine the position of the Gaussian primitives in a self-supervised manner, without using depth ground truth and camera extrinsics during training. The Gaussian network independently predicts primitive parameters from each input image, including covariance, opacity, and spherical harmonics coefficients. At the inference stage, our model can achieve feed-forward reconstruction from flexible multi-frame surround-view input. Experiments on the nuScenes dataset show that our model outperforms existing state-of-the-art feed-forward and scene-optimized reconstruction methods in terms of reconstruction.
Paper Structure (28 sections, 13 equations, 6 figures, 6 tables)

This paper contains 28 sections, 13 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Comparison of our DrivingForward with the latest related works. We achieve real-time reconstruction from small overlap inputs with fewer computing resources.
  • Figure 1: Visualization Comparison of Gaussian Primitives. We compare the reconstructed geometry quality by visualizing zoom-out views of 3D Gaussians Primitives predicted by pixelSplat, MVSplat, and our DrivingForward. Unlike pixelSplat and MVSplat exhibit obvious floating artifacts at the boundaries and are blurred inside the scene, our DrivingForward maintains clear edges and high quality of inside Gaussian primitives, demonstrating its effectiveness in driving scenes.
  • Figure 2: Overview of DrivingForward. Given sparse surround-view input from vehicle-mounted cameras, our model learns scale-aware localization for Gaussian primitives from the small overlap of spatial and temporal context views. A Gaussian network predicts other parameters from each image individually. This feed-forward pipeline enables the real-time reconstruction of driving scenes and the independent prediction from single-frame images supports flexible input modes. At the inference stage, we include only the depth network and the Gaussian network, as shown in the lower part of the figure.
  • Figure 2: Complete visualization results in the main paper. Compared with the state-of-the-art feed-forward and scene-optimized reconstruction methods, our method reduces artifacts and produces more detailed surround-view scenes.
  • Figure 3: Qualitative results of novel surrounding views. Details from surrounding views are present for easy comparison.
  • ...and 1 more figures