Table of Contents
Fetching ...

IDSplat: Instance-Decomposed 3D Gaussian Splatting for Driving Scenes

Carl Lindström, Mahan Rafidashti, Maryam Fatemi, Lars Hammarstrand, Martin R. Oswald, Lennart Svensson

TL;DR

IDSplat presents a self-supervised framework for dynamic driving scene reconstruction with explicit instance decomposition and learnable motion trajectories. It leverages a 3D Gaussian Splatting representation where each dynamic object is a rigid-motion instance, initialized with zero-shot masks from Grounded-SAM-2 and registered via DINOv3 features, followed by coordinated-turn trajectory smoothing. The method jointly optimizes Gaussian parameters and trajectories under photometric and lidar consistency losses, achieving competitive novel-view synthesis and lidar rendering on Waymo NOTR and PandaSet, while enabling instance-level editing. Its zero-shot instance decomposition and robust view-density generalization make it practical for large-scale autonomous driving simulation and data augmentation without manual annotations.

Abstract

Reconstructing dynamic driving scenes is essential for developing autonomous systems through sensor-realistic simulation. Although recent methods achieve high-fidelity reconstructions, they either rely on costly human annotations for object trajectories or use time-varying representations without explicit object-level decomposition, leading to intertwined static and dynamic elements that hinder scene separation. We present IDSplat, a self-supervised 3D Gaussian Splatting framework that reconstructs dynamic scenes with explicit instance decomposition and learnable motion trajectories, without requiring human annotations. Our key insight is to model dynamic objects as coherent instances undergoing rigid transformations, rather than unstructured time-varying primitives. For instance decomposition, we employ zero-shot, language-grounded video tracking anchored to 3D using lidar, and estimate consistent poses via feature correspondences. We introduce a coordinated-turn smoothing scheme to obtain temporally and physically consistent motion trajectories, mitigating pose misalignments and tracking failures, followed by joint optimization of object poses and Gaussian parameters. Experiments on the Waymo Open Dataset demonstrate that our method achieves competitive reconstruction quality while maintaining instance-level decomposition and generalizes across diverse sequences and view densities without retraining, making it practical for large-scale autonomous driving applications. Code will be released.

IDSplat: Instance-Decomposed 3D Gaussian Splatting for Driving Scenes

TL;DR

IDSplat presents a self-supervised framework for dynamic driving scene reconstruction with explicit instance decomposition and learnable motion trajectories. It leverages a 3D Gaussian Splatting representation where each dynamic object is a rigid-motion instance, initialized with zero-shot masks from Grounded-SAM-2 and registered via DINOv3 features, followed by coordinated-turn trajectory smoothing. The method jointly optimizes Gaussian parameters and trajectories under photometric and lidar consistency losses, achieving competitive novel-view synthesis and lidar rendering on Waymo NOTR and PandaSet, while enabling instance-level editing. Its zero-shot instance decomposition and robust view-density generalization make it practical for large-scale autonomous driving simulation and data augmentation without manual annotations.

Abstract

Reconstructing dynamic driving scenes is essential for developing autonomous systems through sensor-realistic simulation. Although recent methods achieve high-fidelity reconstructions, they either rely on costly human annotations for object trajectories or use time-varying representations without explicit object-level decomposition, leading to intertwined static and dynamic elements that hinder scene separation. We present IDSplat, a self-supervised 3D Gaussian Splatting framework that reconstructs dynamic scenes with explicit instance decomposition and learnable motion trajectories, without requiring human annotations. Our key insight is to model dynamic objects as coherent instances undergoing rigid transformations, rather than unstructured time-varying primitives. For instance decomposition, we employ zero-shot, language-grounded video tracking anchored to 3D using lidar, and estimate consistent poses via feature correspondences. We introduce a coordinated-turn smoothing scheme to obtain temporally and physically consistent motion trajectories, mitigating pose misalignments and tracking failures, followed by joint optimization of object poses and Gaussian parameters. Experiments on the Waymo Open Dataset demonstrate that our method achieves competitive reconstruction quality while maintaining instance-level decomposition and generalizes across diverse sequences and view densities without retraining, making it practical for large-scale autonomous driving applications. Code will be released.

Paper Structure

This paper contains 24 sections, 9 equations, 8 figures, 14 tables.

Figures (8)

  • Figure 1: IDSplat performs self-supervised reconstruction of dynamic scenes with explicit instance-decomposition and learnable motion trajectories. IDSplat enables high-fidelity rendering of images, instances, and lidar point clouds without the need for human annotations.
  • Figure 2: Overview of our method. 2D masks from Grounded-SAM-2 are lifted to 3D using corresponding lidar point clouds to initialize instances. Object poses are estimated via RANSAC using DINOv3 feature correspondences and further refined through iterative CT smoothing. Trajectories and Gaussian parameters are then optimized to render images, lidar, and instances with motion trajectories.
  • Figure 3: Dynamic mask rendering results. Beyond separating dynamic and static components, our method also renders instance masks for each dynamic object.
  • Figure 4: Qualitative comparisons of novel view synthesis over different view densities ($25\%$, $50\%$, and $75\%$ of training frames) on the dynamic subset of Waymo NOTR. Our instance-decomposed representation enables high-quality rendering of dynamic objects even when trained with sparse viewpoints.
  • Figure 5: Instance editing. Our instance-decomposed representation enables targeted modifications of individual instances, such as their complete removal or the editing of their trajectories.
  • ...and 3 more figures