Table of Contents
Fetching ...

VDG: Vision-Only Dynamic Gaussian for Driving Simulation

Hao Li, Jingfeng Li, Dingwen Zhang, Chenming Wu, Jieqi Shi, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, Junwei Han

TL;DR

This work tackles scalable driving simulation without pose priors by introducing Vision-only Dynamic Gaussian (VDG), which leverages self-supervised visual odometry for pose and dense monocular depth alongside a 3D Gaussian Splatting representation. It couples RGB/depth supervision with motion-mask guidance to decompose scenes into static and dynamic components and refines camera poses during training. The method demonstrates strong performance on Waymo and KITTI in both dynamic view synthesis and pose estimation, outperforming pose-free baselines and approaching GT-pose methods while enabling RGB-only input. Overall, VDG offers a practical, faster, and scalable pathway for driving simulation that bypasses LiDAR and precomputed poses, broadening accessibility for sim-to-real research and large-scale urban scenarios.

Abstract

Dynamic Gaussian splatting has led to impressive scene reconstruction and image synthesis advances in novel views. Existing methods, however, heavily rely on pre-computed poses and Gaussian initialization by Structure from Motion (SfM) algorithms or expensive sensors. For the first time, this paper addresses this issue by integrating self-supervised VO into our pose-free dynamic Gaussian method (VDG) to boost pose and depth initialization and static-dynamic decomposition. Moreover, VDG can work with only RGB image input and construct dynamic scenes at a faster speed and larger scenes compared with the pose-free dynamic view-synthesis method. We demonstrate the robustness of our approach via extensive quantitative and qualitative experiments. Our results show favorable performance over the state-of-the-art dynamic view synthesis methods. Additional video and source code will be posted on our project page at https://3d-aigc.github.io/VDG.

VDG: Vision-Only Dynamic Gaussian for Driving Simulation

TL;DR

This work tackles scalable driving simulation without pose priors by introducing Vision-only Dynamic Gaussian (VDG), which leverages self-supervised visual odometry for pose and dense monocular depth alongside a 3D Gaussian Splatting representation. It couples RGB/depth supervision with motion-mask guidance to decompose scenes into static and dynamic components and refines camera poses during training. The method demonstrates strong performance on Waymo and KITTI in both dynamic view synthesis and pose estimation, outperforming pose-free baselines and approaching GT-pose methods while enabling RGB-only input. Overall, VDG offers a practical, faster, and scalable pathway for driving simulation that bypasses LiDAR and precomputed poses, broadening accessibility for sim-to-real research and large-scale urban scenarios.

Abstract

Dynamic Gaussian splatting has led to impressive scene reconstruction and image synthesis advances in novel views. Existing methods, however, heavily rely on pre-computed poses and Gaussian initialization by Structure from Motion (SfM) algorithms or expensive sensors. For the first time, this paper addresses this issue by integrating self-supervised VO into our pose-free dynamic Gaussian method (VDG) to boost pose and depth initialization and static-dynamic decomposition. Moreover, VDG can work with only RGB image input and construct dynamic scenes at a faster speed and larger scenes compared with the pose-free dynamic view-synthesis method. We demonstrate the robustness of our approach via extensive quantitative and qualitative experiments. Our results show favorable performance over the state-of-the-art dynamic view synthesis methods. Additional video and source code will be posted on our project page at https://3d-aigc.github.io/VDG.

Paper Structure

This paper contains 18 sections, 4 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Our proposed VDG is crafted to effectively and uniformly reconstruct large, dynamic urban scenes as well as predicted poses with only image input. Here, we showcase our reconstruction results and pose evaluation on KITTI geiger2012we and Waymo waymo_open_dataset datasets, and further compared with the latest pose-free methods. The reconstructed visualizations reveal that our method enables us to model static and dynamic objects without pose priors. Moreover, our method achieves much more accurate pose prediction than other pose-free methods.
  • Figure 2: The proposed VDG. (a) VDG Initialization: uses the off-the-shelf VO network $\mathcal{P}(\cdot)$, $\mathcal{M}(\cdot)$, and $\mathcal{D}(\cdot)$ to estimate the global poses $T_t$, motion masks $M_t$, and depth maps $D_t$ (see Sec. \ref{['sec:vo']}). Given poses $T_t$ and corresponding depth maps $D_t$, we project the depth maps into 3D space to initialize the Gaussian points $G^k_t =\{\tilde{\mu}^k_t, \Sigma^k, \widetilde{\alpha}^k_t, S^k\}$. Note that the velocity $v$ of each Gaussian is set to 0 (see Sec. \ref{['sec:init']}). (b) VDG Training Procedure: Given initialized Gaussians $G^k_t$, we train our VDG using RGB and depth supervision (see Sec. \ref{['sec:train']}). Moreover, we apply motion mask supervision to decompose static and dynamic scenes (Sec. \ref{['sec:motion']}). In the end, we adopt a training strategy to refine vo-given poses $T_t$ (Sec. \ref{['sec:strategy']}).
  • Figure 3: Illustration of Frozen Gaussian Parameters.
  • Figure 4: Qualitative comparison on KITTI dataset regarding pose accuracy and rendered quality. Our method outperforms other baselines, even in cases where pose estimation is relatively poor.
  • Figure 5: Qualitative comparison of our approach and other baselines, including Pose-free methods, GT pose needed methods and GT images on Waymo Open Datasetwaymo_open_dataset. We show two case views of synthesis results under different scenes.
  • ...and 4 more figures