Table of Contents
Fetching ...

TRASE: Tracking-free 4D Segmentation and Editing

Yun-Jin Li, Mariia Gladkova, Yan Xia, Daniel Cremers

TL;DR

TRASE addresses dynamic scene understanding by learning a tracking-free 4D semantic field. It combines dynamic geometry reconstruction with a $32$-dimensional Gaussian feature learning guided by 2D SAM masks through a soft-mined contrastive objective, followed by DBSCAN clustering to yield temporally and spatially consistent object segments. The approach enables interactive editing tasks such as object removal, scene composition, and style transfer directly in 3D, and achieves state-of-the-art segmentation across five dynamic benchmarks with robust novel-view generalization. Overall, TRASE offers a principled, efficient framework for dynamic scene segmentation and editing that scales to multi-view data and real-time interaction.

Abstract

Understanding dynamic 3D scenes is crucial for extended reality (XR) and autonomous driving. Incorporating semantic information into 3D reconstruction enables holistic scene representations, unlocking immersive and interactive applications. To this end, we introduce TRASE, a novel tracking-free 4D segmentation method for dynamic scene understanding. TRASE learns a 4D segmentation feature field in a weakly-supervised manner, leveraging a soft-mined contrastive learning objective guided by SAM masks. The resulting feature space is semantically coherent and well-separated, and final object-level segmentation is obtained via unsupervised clustering. This enables fast editing, such as object removal, composition, and style transfer, by directly manipulating the scene's Gaussians. We evaluate TRASE on five dynamic benchmarks, demonstrating state-of-the-art segmentation performance from unseen viewpoints and its effectiveness across various interactive editing tasks. Our project page is available at: https://yunjinli.github.io/project-sadg/

TRASE: Tracking-free 4D Segmentation and Editing

TL;DR

TRASE addresses dynamic scene understanding by learning a tracking-free 4D semantic field. It combines dynamic geometry reconstruction with a -dimensional Gaussian feature learning guided by 2D SAM masks through a soft-mined contrastive objective, followed by DBSCAN clustering to yield temporally and spatially consistent object segments. The approach enables interactive editing tasks such as object removal, scene composition, and style transfer directly in 3D, and achieves state-of-the-art segmentation across five dynamic benchmarks with robust novel-view generalization. Overall, TRASE offers a principled, efficient framework for dynamic scene segmentation and editing that scales to multi-view data and real-time interaction.

Abstract

Understanding dynamic 3D scenes is crucial for extended reality (XR) and autonomous driving. Incorporating semantic information into 3D reconstruction enables holistic scene representations, unlocking immersive and interactive applications. To this end, we introduce TRASE, a novel tracking-free 4D segmentation method for dynamic scene understanding. TRASE learns a 4D segmentation feature field in a weakly-supervised manner, leveraging a soft-mined contrastive learning objective guided by SAM masks. The resulting feature space is semantically coherent and well-separated, and final object-level segmentation is obtained via unsupervised clustering. This enables fast editing, such as object removal, composition, and style transfer, by directly manipulating the scene's Gaussians. We evaluate TRASE on five dynamic benchmarks, demonstrating state-of-the-art segmentation performance from unseen viewpoints and its effectiveness across various interactive editing tasks. Our project page is available at: https://yunjinli.github.io/project-sadg/

Paper Structure

This paper contains 41 sections, 8 equations, 29 figures, 15 tables.

Figures (29)

  • Figure 1: We propose TRASE, a novel tracking-free 4D segmentation approach. TRASE achieves superior object segmentation from click prompts and further supports interactive editing tasks such as object removal and text-prompt-based segmentation.
  • Figure 2: Our pipeline. TRASE consists of two main components: dynamic geometry reconstruction (\ref{['subsec:georecon']}) and Gaussian feature learning (\ref{['subsec:gaussfeat']}). We adopt the approach from yang2024deformable to effectively learn dynamic 3D reconstruction. Given a 4D reconstruction, we learn Gaussian features $\boldsymbol{F} \in \mathbb{R}^{N \times 32}$ using a novel contrastive learning objective guided by SAM kirillov2023sam masks. Once trained, we apply clustering ester1996density directly to the learned features, enabling segmentation field rendering. Our representation supports various scene-editing applications, including object segmentation via click/text prompts in our GUI, object removal, and scene composition.
  • Figure 3: Example of a failure case from the video tracker DEVA cheng2023tracking. Different colors refer to various object IDs associated by the model. We can observe that due to the inconsistent presence of the human torso in the video, DEVA fails to provide reliable and consistent object masks for supervision, resulting in a noisy class-agnostic segmentation and poorly segmented objects for dynamic Gaussian Grouping ye2023gaussian (GG). We also provide a video of this example in the supplementary materials.
  • Figure 4: The illustration of pixel-mask correspondence vector $\boldsymbol{y}_{i}$. The pixel-mask correspondence vector is constructed from a subset of SAM masks $\boldsymbol{\mathcal{M}}_{SAM}$ for a given image $\boldsymbol{I}_{GT}$.
  • Figure 5: Segmentation qualitative results. While Gaussian Grouping ye2023gaussian and SAGA cen2023saga suffer from spurious Gaussians, our method demonstrates crisp and floater-free segmentation. Object boundaries in SA4D ji2024segment and DGD labe2024dgd segmentations are not tight and capture part of the background while rendering. OpenGaussian wu2025opengaussian is not able to fully capture the whole objects in its segmentation. Our model consistently demonstrates superior segmentation quality and crisp object masks.
  • ...and 24 more figures