Table of Contents
Fetching ...

Director: Instance-aware Gaussian Splatting for Dynamic Scene Modeling and Understanding

Yuheng Jiang, Yiwen Cai, Zihao Wang, Yize Wu, Sicheng Li, Zhuo Su, Shaohui Jiao, Lan Xu

Abstract

Volumetric video seeks to model dynamic scenes as temporally coherent 4D representations. While recent Gaussian-based approaches achieve impressive rendering fidelity, they primarily emphasize appearance but are largely agnostic to instance-level structure, limiting stable tracking and semantic reasoning in highly dynamic scenarios. In this paper, we present Director, a unified spatio-temporal Gaussian representation that jointly models human performance, high-fidelity rendering, and instance-level semantics. Our key insight is that embedding instance-consistent semantics naturally complements 4D modeling, enabling more accurate scene decomposition while supporting robust dynamic scene understanding. To this end, we leverage temporally aligned instance masks and sentence embeddings derived from Multimodal Large Language Models to supervise the learnable semantic features of each Gaussian via two MLP decoders, enabling language-aligned 4D representations and enforcing identity consistency over time. To enhance temporal stability, we bridge 2D optical flow with 4D Gaussians and finetune their motions, yielding reliable initialization and reducing drift. For the training, we further introduce a geometry-aware SDF constraints, along with regularization terms that enforces surface continuity, enhancing temporal coherence in dynamic foreground modeling. Experiments demonstrate that Director achieves temporally coherent 4D reconstructions while simultaneously enabling instance segmentation and open-vocabulary querying.

Director: Instance-aware Gaussian Splatting for Dynamic Scene Modeling and Understanding

Abstract

Volumetric video seeks to model dynamic scenes as temporally coherent 4D representations. While recent Gaussian-based approaches achieve impressive rendering fidelity, they primarily emphasize appearance but are largely agnostic to instance-level structure, limiting stable tracking and semantic reasoning in highly dynamic scenarios. In this paper, we present Director, a unified spatio-temporal Gaussian representation that jointly models human performance, high-fidelity rendering, and instance-level semantics. Our key insight is that embedding instance-consistent semantics naturally complements 4D modeling, enabling more accurate scene decomposition while supporting robust dynamic scene understanding. To this end, we leverage temporally aligned instance masks and sentence embeddings derived from Multimodal Large Language Models to supervise the learnable semantic features of each Gaussian via two MLP decoders, enabling language-aligned 4D representations and enforcing identity consistency over time. To enhance temporal stability, we bridge 2D optical flow with 4D Gaussians and finetune their motions, yielding reliable initialization and reducing drift. For the training, we further introduce a geometry-aware SDF constraints, along with regularization terms that enforces surface continuity, enhancing temporal coherence in dynamic foreground modeling. Experiments demonstrate that Director achieves temporally coherent 4D reconstructions while simultaneously enabling instance segmentation and open-vocabulary querying.

Paper Structure

This paper contains 11 sections, 12 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: We introduce Director, an instance-aware spatio-temporal Gaussian representation that enables robust human performance tracking, high-fidelity rendering, and instance-level understanding for open-vocabulary queries.
  • Figure 2: Overview of Director. Using temporally consistent SAM3 masks and sentence embeddings, our method decomposes the scene into static background and dynamic foreground, learning language- and instance-aligned features for robust tracking, high-quality rendering, and accurate instance segmentation.
  • Figure 3: Gallery of our results. All images are rendered from novel views. The first four rows show high-fidelity rendering results under challenging scenarios, including fast motions and severe occlusions. The fifth row presents the corresponding instance-level 4D segmentation of the fourth row.
  • Figure 4: Qualitative comparison with 4DGS Wu_2024_CVPR, Spacetime Gaussian li2023spacetime, and TaoGS taogs. Ours shows the best quality. Note that the zoomed-in regions are extremely small in the original image, where even the ground truth is not very sharp.
  • Figure 5: Qualitative comparison with 4D segmentation methods, including SA4D sa4d, SADG li2024sadg, and 4-LEGS fiebelman20254legs.Ours achieves more accurate instance segmentation.
  • ...and 2 more figures