SplatFlow: Self-Supervised Dynamic Gaussian Splatting in Neural Motion Flow Field for Autonomous Driving
Su Sun, Cheng Zhao, Zhuoyang Sun, Yingjie Victor Chen, Mei Chen
TL;DR
SplatFlow tackles dynamic urban scene reconstruction without expensive 3D bounding-box annotations by introducing Self-Supervised Dynamic Gaussian Splatting within Neural Motion Flow Fields (NMFF). It represents static background with 3D Gaussians and dynamic objects with 4D Gaussians, while NMFF jointly models temporal correspondences of Gaussians and LiDAR points to enforce cross-view consistency. A LiDAR-based motion prior is learned via NMFF pretraining, and optical-flow distillation from 2D foundation models further enhances dynamic-object identification. Across Waymo Open and KITTI, SplatFlow achieves state-of-the-art performance for image reconstruction and novel-view synthesis, while operating without tracked 3D bounding boxes and delivering real-time rendering speeds, highlighting its practical potential for scalable autonomous driving workflows.
Abstract
Most existing Dynamic Gaussian Splatting methods for complex dynamic urban scenarios rely on accurate object-level supervision from expensive manual labeling, limiting their scalability in real-world applications. In this paper, we introduce SplatFlow, a Self-Supervised Dynamic Gaussian Splatting within Neural Motion Flow Fields (NMFF) to learn 4D space-time representations without requiring tracked 3D bounding boxes, enabling accurate dynamic scene reconstruction and novel view RGB/depth/flow synthesis. SplatFlow designs a unified framework to seamlessly integrate time-dependent 4D Gaussian representation within NMFF, where NMFF is a set of implicit functions to model temporal motions of both LiDAR points and Gaussians as continuous motion flow fields. Leveraging NMFF, SplatFlow effectively decomposes static background and dynamic objects, representing them with 3D and 4D Gaussian primitives, respectively. NMFF also models the correspondences of each 4D Gaussian across time, which aggregates temporal features to enhance cross-view consistency of dynamic components. SplatFlow further improves dynamic object identification by distilling features from 2D foundation models into 4D space-time representation. Comprehensive evaluations conducted on the Waymo and KITTI Datasets validate SplatFlow's state-of-the-art (SOTA) performance for both image reconstruction and novel view synthesis in dynamic urban scenarios.
