Table of Contents
Fetching ...

Divide-and-Conquer: Dual-Hierarchical Optimization for Semantic 4D Gaussian Spatting

Zhiying Yan, Yiyuan Liang, Shilv Cai, Tao Zhang, Sheng Zhong, Luxin Yan, Xu Zou

TL;DR

The paper tackles the challenge of dynamic scene understanding with semantic 4D Gaussians by introducing Dual-Hierarchical Optimization (DHO), which separates static background from dynamic foreground via Hierarchical Gaussian Flow and provides semantic-guided rendering through Hierarchical Gaussian Guidance. It augments 4D Gaussian Splatting with semantic features bound to Gaussians, compressed CLIP semantics, and a deformation-aware rendering pipeline, enabling higher-fidelity rendering and improved segmentation in complex scenes. Empirical results on synthetic and real datasets show consistent gains in PSNR, SSIM, LPIPS, and mIoU, with ablations confirming the essential roles of HGF and HGG. The approach is memory-efficient and adaptable to existing models, offering robust semantic reasoning and downstream editability for dynamic 4D scenes.

Abstract

Semantic 4D Gaussians can be used for reconstructing and understanding dynamic scenes, with temporal variations than static scenes. Directly applying static methods to understand dynamic scenes will fail to capture the temporal features. Few works focus on dynamic scene understanding based on Gaussian Splatting, since once the same update strategy is employed for both dynamic and static parts, regardless of the distinction and interaction between Gaussians, significant artifacts and noise appear. We propose Dual-Hierarchical Optimization (DHO), which consists of Hierarchical Gaussian Flow and Hierarchical Gaussian Guidance in a divide-and-conquer manner. The former implements effective division of static and dynamic rendering and features. The latter helps to mitigate the issue of dynamic foreground rendering distortion in textured complex scenes. Extensive experiments show that our method consistently outperforms the baselines on both synthetic and real-world datasets, and supports various downstream tasks. Project Page: https://sweety-yan.github.io/DHO.

Divide-and-Conquer: Dual-Hierarchical Optimization for Semantic 4D Gaussian Spatting

TL;DR

The paper tackles the challenge of dynamic scene understanding with semantic 4D Gaussians by introducing Dual-Hierarchical Optimization (DHO), which separates static background from dynamic foreground via Hierarchical Gaussian Flow and provides semantic-guided rendering through Hierarchical Gaussian Guidance. It augments 4D Gaussian Splatting with semantic features bound to Gaussians, compressed CLIP semantics, and a deformation-aware rendering pipeline, enabling higher-fidelity rendering and improved segmentation in complex scenes. Empirical results on synthetic and real datasets show consistent gains in PSNR, SSIM, LPIPS, and mIoU, with ablations confirming the essential roles of HGF and HGG. The approach is memory-efficient and adaptable to existing models, offering robust semantic reasoning and downstream editability for dynamic 4D scenes.

Abstract

Semantic 4D Gaussians can be used for reconstructing and understanding dynamic scenes, with temporal variations than static scenes. Directly applying static methods to understand dynamic scenes will fail to capture the temporal features. Few works focus on dynamic scene understanding based on Gaussian Splatting, since once the same update strategy is employed for both dynamic and static parts, regardless of the distinction and interaction between Gaussians, significant artifacts and noise appear. We propose Dual-Hierarchical Optimization (DHO), which consists of Hierarchical Gaussian Flow and Hierarchical Gaussian Guidance in a divide-and-conquer manner. The former implements effective division of static and dynamic rendering and features. The latter helps to mitigate the issue of dynamic foreground rendering distortion in textured complex scenes. Extensive experiments show that our method consistently outperforms the baselines on both synthetic and real-world datasets, and supports various downstream tasks. Project Page: https://sweety-yan.github.io/DHO.

Paper Structure

This paper contains 21 sections, 12 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Visualization of different methods. Top Left: Use static methods directly. Top Right: First train the 4D scene and then freeze the geometric and color parameters, adding the semantic feature property for optimization. Bottom Left: Jointly optimizing the 4D scene and semantic features. Bottom Right: Our Dual-Hierarchical Optimization method. By adopting a divide-and-conquer strategy for the Gaussians optimization, we achieve substantial improvements in both rendering quality and segmentation accuracy.
  • Figure 2: Visualization of Gaussian points with large deformation. We select the top-k Gaussian points with the largest deformation for rendering. Vanilla 4DGS mixes highly deformable dynamic foregrounds with static backgrounds. Our method effectively separates the static and dynamic parts.
  • Figure 3: The overall pipeline of our model. We add semantic properties to each Gaussian and obtain the geometric deformation of the Gaussian at each timestamp $t$ through the deformation field. In the coarse stage, Gaussians are subjected to geometric constraints. While in the fine stage, geometries are relaxed and semantic feature constraints are introduced, ensuring foreground-background separation. We utilize dynamic foreground masks obtained from scene priors for hierarchical Gaussian guidance of the scene, enhancing the rendering quality of dynamic foregrounds with complex backgrounds.
  • Figure 4: Visualizatio of the HyperNeRF dataset. (a) Visualization of the "Broom" scene. Our method outperforms the baseline, whereas DGD nearly fails in complex scenes. (b) Visual segmentation comparisons of our method and DGD. Our method significantly reduces artifacts and noise. (c) Visual scale comparisons of our method and SA4D. Our method incorporates multi-scale information to perceive objects at different scales.
  • Figure 5: Visualization of ablation study. (a) Ablation of HGF results in fragmented semantic features. (b) Ablation of HGG results in degradation in foreground rendering quality.
  • ...and 8 more figures