Table of Contents
Fetching ...

Inst4DGS: Instance-Decomposed 4D Gaussian Splatting with Multi-Video Label Permutation Learning

Yonghan Lee, Dinesh Manocha

Abstract

We present Inst4DGS, an instance-decomposed 4D Gaussian Splatting (4DGS) approach with long-horizon per-Gaussian trajectories. While dynamic 4DGS has advanced rapidly, instance-decomposed 4DGS remains underexplored, largely due to the difficulty of associating inconsistent instance labels across independently segmented multi-view videos. We address this challenge by introducing per-video label-permutation latents that learn cross-video instance matches through a differentiable Sinkhorn layer, enabling direct multi-view supervision with consistent identity preservation. This explicit label alignment yields sharp decision boundaries and temporally stable identities without identity drift. To further improve efficiency, we propose instance-decomposed motion scaffolds that provide low-dimensional motion bases per object for long-horizon trajectory optimization. Experiments on Panoptic Studio and Neural3DV show that Inst4DGS jointly supports tracking and instance decomposition while achieving state-of-the-art rendering and segmentation quality. On the Panoptic Studio dataset, Inst4DGS improves PSNR from 26.10 to 28.36, and instance mIoU from 0.6310 to 0.9129, over the strongest baseline.

Inst4DGS: Instance-Decomposed 4D Gaussian Splatting with Multi-Video Label Permutation Learning

Abstract

We present Inst4DGS, an instance-decomposed 4D Gaussian Splatting (4DGS) approach with long-horizon per-Gaussian trajectories. While dynamic 4DGS has advanced rapidly, instance-decomposed 4DGS remains underexplored, largely due to the difficulty of associating inconsistent instance labels across independently segmented multi-view videos. We address this challenge by introducing per-video label-permutation latents that learn cross-video instance matches through a differentiable Sinkhorn layer, enabling direct multi-view supervision with consistent identity preservation. This explicit label alignment yields sharp decision boundaries and temporally stable identities without identity drift. To further improve efficiency, we propose instance-decomposed motion scaffolds that provide low-dimensional motion bases per object for long-horizon trajectory optimization. Experiments on Panoptic Studio and Neural3DV show that Inst4DGS jointly supports tracking and instance decomposition while achieving state-of-the-art rendering and segmentation quality. On the Panoptic Studio dataset, Inst4DGS improves PSNR from 26.10 to 28.36, and instance mIoU from 0.6310 to 0.9129, over the strongest baseline.
Paper Structure (37 sections, 11 equations, 6 figures, 4 tables)

This paper contains 37 sections, 11 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: We introduce a novel instance-decomposed 4D Gaussian Splatting (4DGS) approach for long-term, identity-consistent tracking. Given multi-view videos, our method reconstructs a 4D scene representation (left) and decomposes it into per-instance 4DGS tracks (right). By integrating instance segmentation with 4DGS optimization through per-instance motion bases, we enable efficient motion modeling and stable long-horizon tracking for each scene object.
  • Figure 2: Temporally consistent identity reconstruction. Our method maintains stable instance identities over time while achieving high-fidelity tracking and rendering. In contrast, TRASE li2026trasetrackingfree4dsegmentation exhibits limited motion tracking and identity drift (red box).
  • Figure 3: Inst4DGS Pipeline: Given multi-view videos and per-video segmentation labels, our method reconstructs an instance-decomposed 4DGS scene with long-term per-instance trajectories. The pipeline has two stages: (1) instance-decomposed 3DGS initialization and (2) sequential 4D optimization. A per-video learnable latent differentiably aligns inconsistent instance labels across views, and per-instance motion bases provide a low-dimensional motion scaffold for efficient optimization.
  • Figure 4: Example of local and canonical instance maps produced by our method. Because each video stream is segmented independently, SAM3 annotations contain conflicting labels across views (blue circle). Our method resolves this by learning cross-view label permutations to produce consistent canonical labels (white circles). Our differentiable permutation module maps canonical labels ($4^{th}$ row) to view-specific local labels ($3^{rd}$ row), enabling direct supervision despite cross-view inconsistencies.
  • Figure 5: Qualitative photometric rendering comparison. SA4D and TRASE fail under diverse motions in Panoptic Studio (basketball, football, softball), while our method preserves high-fidelity renderings. We also outperform STSG and Dynamic3DGS, which lack instance-decomposed rendering capability.
  • ...and 1 more figures