Table of Contents
Fetching ...

Segment Any 4D Gaussians

Shengxiang Ji, Guanjun Wu, Jiemin Fang, Jiazhong Cen, Taoran Yi, Wenyu Liu, Qi Tian, Xinggang Wang

TL;DR

This work tackles open-world 4D segmentation for dynamic scenes by extending 4D Gaussian Splatting with SA4D, which introduces a temporal identity feature field and a 4D segmentation refinement to address Gaussian drifting. By coupling a lightweight identity encoding network and a Gaussian identity table with a deformable 4D-GS representation, the approach achieves fast, high-quality 4D segmentation and enables interactive editing tasks such as removal, recoloring, and composition. It demonstrates strong improvements over 3D baselines on HyperNeRF and Neu3D and showcases practical editing capabilities, while transparently discussing limitations and potential future work in multi-view identity handling. Overall, SA4D provides a practical, open-world framework for open-set 4D scene understanding and manipulation.

Abstract

Modeling, understanding, and reconstructing the real world are crucial in XR/VR. Recently, 3D Gaussian Splatting (3D-GS) methods have shown remarkable success in modeling and understanding 3D scenes. Similarly, various 4D representations have demonstrated the ability to capture the dynamics of the 4D world. However, there is a dearth of research focusing on segmentation within 4D representations. In this paper, we propose Segment Any 4D Gaussians (SA4D), one of the first frameworks to segment anything in the 4D digital world based on 4D Gaussians. In SA4D, an efficient temporal identity feature field is introduced to handle Gaussian drifting, with the potential to learn precise identity features from noisy and sparse input. Additionally, a 4D segmentation refinement process is proposed to remove artifacts. Our SA4D achieves precise, high-quality segmentation within seconds in 4D Gaussians and shows the ability to remove, recolor, compose, and render high-quality anything masks. More demos are available at: https://jsxzs.github.io/sa4d/.

Segment Any 4D Gaussians

TL;DR

This work tackles open-world 4D segmentation for dynamic scenes by extending 4D Gaussian Splatting with SA4D, which introduces a temporal identity feature field and a 4D segmentation refinement to address Gaussian drifting. By coupling a lightweight identity encoding network and a Gaussian identity table with a deformable 4D-GS representation, the approach achieves fast, high-quality 4D segmentation and enables interactive editing tasks such as removal, recoloring, and composition. It demonstrates strong improvements over 3D baselines on HyperNeRF and Neu3D and showcases practical editing capabilities, while transparently discussing limitations and potential future work in multi-view identity handling. Overall, SA4D provides a practical, open-world framework for open-set 4D scene understanding and manipulation.

Abstract

Modeling, understanding, and reconstructing the real world are crucial in XR/VR. Recently, 3D Gaussian Splatting (3D-GS) methods have shown remarkable success in modeling and understanding 3D scenes. Similarly, various 4D representations have demonstrated the ability to capture the dynamics of the 4D world. However, there is a dearth of research focusing on segmentation within 4D representations. In this paper, we propose Segment Any 4D Gaussians (SA4D), one of the first frameworks to segment anything in the 4D digital world based on 4D Gaussians. In SA4D, an efficient temporal identity feature field is introduced to handle Gaussian drifting, with the potential to learn precise identity features from noisy and sparse input. Additionally, a 4D segmentation refinement process is proposed to remove artifacts. Our SA4D achieves precise, high-quality segmentation within seconds in 4D Gaussians and shows the ability to remove, recolor, compose, and render high-quality anything masks. More demos are available at: https://jsxzs.github.io/sa4d/.
Paper Structure (26 sections, 16 equations, 13 figures, 5 tables, 2 algorithms)

This paper contains 26 sections, 16 equations, 13 figures, 5 tables, 2 algorithms.

Figures (13)

  • Figure 1: Illustration of Gaussian drifting between objects in 4D-GS wu20234daussians on the HyperNeRF park2021hypernerf dataset. The left part is a random input view and the 'star' stands for prompt. It is obvious that the segmentation results become inaccurate in different timestamps. It is because some 3D Gaussians of the cookie (object 1) segmented in frame 1 transform into another object (object 2) in frame 2 as shown in the right part.
  • Figure 2: Overview of our training pipeline. Given a timestamp $t$ and canonical 3D Gaussians $\mathcal{G}$, the ID encoding $e$ and deformed 3D Gaussians $\mathcal{G}^{'}$ will be predicted by an optimizable $\phi_{\theta}$ and frozen deformation field network $\mathcal{F}$, respectively. Then the ID encoding $e$ is splatted to $E$, then $\phi_c$ is used to classify each pixel's ID $f$, and the whole training pipeline is supervised by $I_{seg}$ predicted by video tracker with $\mathcal{L}_{loss}$.
  • Figure 3: Visual comparisons of our method and baselines on the HyperNeRF park2021hypernerf and Neu3D li2022neural dataset. The upper five scenes are from the HyperNeRF dataset and we visualize and compare the segmentation results at three random novel views and timestamps for each scene. The lower six scenes are from the Neu3D dataset and we visualize and compare the segmentation results at one random novel view and timestamp for each scene.
  • Figure 4: More examples of composition, deletion with segmented 4D Gaussians. (a): copying a cookie in the scene. (b) Deleting the cup in the scene. (c) Compositing the man with a room in Neu3D li2021neural and Mip-NeRF360 mipnerf dataset. (d) Compositing the chickchicken with the man in HyperNeRF park2021hypernerf and Neu3D li2021neural dataset.
  • Figure 5: Ablation study of our temporal identity field network. (a) The black regions represent the void class (Illustrated as black color). Predictions from input 2D supervision (e.g. video tracker DEVA) are sometimes incorrect (e.g. cup labeled void 0 in the image above) and noisy (e.g. handle labeled void in the image below). (b) Due to the Gaussian drifting, some Gaussians outside the cookie in the image above will transform into the cookie as shown in the image below.
  • ...and 8 more figures