IDSplat: Instance-Decomposed 3D Gaussian Splatting for Driving Scenes
Carl Lindström, Mahan Rafidashti, Maryam Fatemi, Lars Hammarstrand, Martin R. Oswald, Lennart Svensson
TL;DR
IDSplat presents a self-supervised framework for dynamic driving scene reconstruction with explicit instance decomposition and learnable motion trajectories. It leverages a 3D Gaussian Splatting representation where each dynamic object is a rigid-motion instance, initialized with zero-shot masks from Grounded-SAM-2 and registered via DINOv3 features, followed by coordinated-turn trajectory smoothing. The method jointly optimizes Gaussian parameters and trajectories under photometric and lidar consistency losses, achieving competitive novel-view synthesis and lidar rendering on Waymo NOTR and PandaSet, while enabling instance-level editing. Its zero-shot instance decomposition and robust view-density generalization make it practical for large-scale autonomous driving simulation and data augmentation without manual annotations.
Abstract
Reconstructing dynamic driving scenes is essential for developing autonomous systems through sensor-realistic simulation. Although recent methods achieve high-fidelity reconstructions, they either rely on costly human annotations for object trajectories or use time-varying representations without explicit object-level decomposition, leading to intertwined static and dynamic elements that hinder scene separation. We present IDSplat, a self-supervised 3D Gaussian Splatting framework that reconstructs dynamic scenes with explicit instance decomposition and learnable motion trajectories, without requiring human annotations. Our key insight is to model dynamic objects as coherent instances undergoing rigid transformations, rather than unstructured time-varying primitives. For instance decomposition, we employ zero-shot, language-grounded video tracking anchored to 3D using lidar, and estimate consistent poses via feature correspondences. We introduce a coordinated-turn smoothing scheme to obtain temporally and physically consistent motion trajectories, mitigating pose misalignments and tracking failures, followed by joint optimization of object poses and Gaussian parameters. Experiments on the Waymo Open Dataset demonstrate that our method achieves competitive reconstruction quality while maintaining instance-level decomposition and generalizes across diverse sequences and view densities without retraining, making it practical for large-scale autonomous driving applications. Code will be released.
