Table of Contents
Fetching ...

DrivingScene: A Multi-Task Online Feed-Forward 3D Gaussian Splatting Method for Dynamic Driving Scenes

Qirui Hou, Wenzhang Sun, Chang Zeng, Chunfeng Wang, Hao Li, Jianxun Cui

TL;DR

DrivingScene tackles real-time 4D reconstruction of dynamic driving scenes from two surround-view frames by decoupling static geometry and dynamic motion. It uses a static backbone based on 3D Gaussian Splatting to model geometry and appearance, combined with a lightweight residual flow network that accounts for non-rigid motion, yielding a total motion field for dynamic rendering. A two-stage coarse-to-fine training strategy stabilizes learning by first learning a robust static prior, then refining with dynamics using self-supervised losses. On nuScenes, the method achieves state-of-the-art results for novel-view synthesis and depth estimation while maintaining real-time efficiency and providing intermediate representations like depth and scene flow. This approach offers a practical, multi-task perception solution for autonomous driving with explicit static-dynamic decoupling and efficient online inference.

Abstract

Real-time, high-fidelity reconstruction of dynamic driving scenes is challenged by complex dynamics and sparse views, with prior methods struggling to balance quality and efficiency. We propose DrivingScene, an online, feed-forward framework that reconstructs 4D dynamic scenes from only two consecutive surround-view images. Our key innovation is a lightweight residual flow network that predicts the non-rigid motion of dynamic objects per camera on top of a learned static scene prior, explicitly modeling dynamics via scene flow. We also introduce a coarse-to-fine training paradigm that circumvents the instabilities common to end-to-end approaches. Experiments on nuScenes dataset show our image-only method simultaneously generates high-quality depth, scene flow, and 3D Gaussian point clouds online, significantly outperforming state-of-the-art methods in both dynamic reconstruction and novel view synthesis.

DrivingScene: A Multi-Task Online Feed-Forward 3D Gaussian Splatting Method for Dynamic Driving Scenes

TL;DR

DrivingScene tackles real-time 4D reconstruction of dynamic driving scenes from two surround-view frames by decoupling static geometry and dynamic motion. It uses a static backbone based on 3D Gaussian Splatting to model geometry and appearance, combined with a lightweight residual flow network that accounts for non-rigid motion, yielding a total motion field for dynamic rendering. A two-stage coarse-to-fine training strategy stabilizes learning by first learning a robust static prior, then refining with dynamics using self-supervised losses. On nuScenes, the method achieves state-of-the-art results for novel-view synthesis and depth estimation while maintaining real-time efficiency and providing intermediate representations like depth and scene flow. This approach offers a practical, multi-task perception solution for autonomous driving with explicit static-dynamic decoupling and efficient online inference.

Abstract

Real-time, high-fidelity reconstruction of dynamic driving scenes is challenged by complex dynamics and sparse views, with prior methods struggling to balance quality and efficiency. We propose DrivingScene, an online, feed-forward framework that reconstructs 4D dynamic scenes from only two consecutive surround-view images. Our key innovation is a lightweight residual flow network that predicts the non-rigid motion of dynamic objects per camera on top of a learned static scene prior, explicitly modeling dynamics via scene flow. We also introduce a coarse-to-fine training paradigm that circumvents the instabilities common to end-to-end approaches. Experiments on nuScenes dataset show our image-only method simultaneously generates high-quality depth, scene flow, and 3D Gaussian point clouds online, significantly outperforming state-of-the-art methods in both dynamic reconstruction and novel view synthesis.

Paper Structure

This paper contains 11 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Example predictions by our method on nuScenes caesar2020nuscenes. Top to bottom: input image (one of the sequence), depth map and optical flow. Our model is fully self-supervised and can handle dynamic objects and occlusions explicitly.
  • Figure 2: Overview of DrivingScene. Given two consecutive surround-view frames, our framework first predicts a static scene composed of 3D Gaussian primitives using a depth and a Gaussian parameter network. A residual flow network then computes the non-rigid motion field between the frames. This motion is combined with the rigid flow derived from ego-motion and applied as temporal displacements to the static Gaussians, resulting in a complete, dynamic 4D scene representation.
  • Figure 3: The architecture of residual flow network
  • Figure 4: Qualitative results of surrounding views. Details from surrounding views are present for easy comparison.
  • Figure 5: The comparison of rigid flow with full flow