Table of Contents
Fetching ...

TSGaussian: Semantic and Depth-Guided Target-Specific Gaussian Splatting from Sparse Views

Liang Zhao, Zehan Bao, Yi Xie, Hong Chen, Yaohui Chen, Weifu Li

TL;DR

The paper introduces TSGaussian, a target-specific Gaussian Splatting framework guided by semantic segmentation and monocular depth priors to reconstruct designated objects from sparse views. It couples 2D detections and prompts from YOLOv9 and SAM with a learnable identity encoding for Gaussians, a differentiable semantic rendering, and a pruning strategy to concentrate resources on targets while suppressing background noise. A multi-scale depth regularization scheme, including soft-hard depth losses and global-local depth losses, stabilizes geometry under sparse viewpoints and enhances depth accuracy, all within an end-to-end training objective that includes color, semantic, and depth terms. Experimental results on public datasets and a new Citrus dataset show that TSGaussian outperforms state-of-the-art 3D Gaussian methods across PSNR, SSIM, and LPIPS, demonstrating improved target-specific novel view synthesis with reduced artifacts and background leakage.

Abstract

Recent advances in Gaussian Splatting have significantly advanced the field, achieving both panoptic and interactive segmentation of 3D scenes. However, existing methodologies often overlook the critical need for reconstructing specified targets with complex structures from sparse views. To address this issue, we introduce TSGaussian, a novel framework that combines semantic constraints with depth priors to avoid geometry degradation in challenging novel view synthesis tasks. Our approach prioritizes computational resources on designated targets while minimizing background allocation. Bounding boxes from YOLOv9 serve as prompts for Segment Anything Model to generate 2D mask predictions, ensuring semantic accuracy and cost efficiency. TSGaussian effectively clusters 3D gaussians by introducing a compact identity encoding for each Gaussian ellipsoid and incorporating 3D spatial consistency regularization. Leveraging these modules, we propose a pruning strategy to effectively reduce redundancy in 3D gaussians. Extensive experiments demonstrate that TSGaussian outperforms state-of-the-art methods on three standard datasets and a new challenging dataset we collected, achieving superior results in novel view synthesis of specific objects. Code is available at: https://github.com/leon2000-ai/TSGaussian.

TSGaussian: Semantic and Depth-Guided Target-Specific Gaussian Splatting from Sparse Views

TL;DR

The paper introduces TSGaussian, a target-specific Gaussian Splatting framework guided by semantic segmentation and monocular depth priors to reconstruct designated objects from sparse views. It couples 2D detections and prompts from YOLOv9 and SAM with a learnable identity encoding for Gaussians, a differentiable semantic rendering, and a pruning strategy to concentrate resources on targets while suppressing background noise. A multi-scale depth regularization scheme, including soft-hard depth losses and global-local depth losses, stabilizes geometry under sparse viewpoints and enhances depth accuracy, all within an end-to-end training objective that includes color, semantic, and depth terms. Experimental results on public datasets and a new Citrus dataset show that TSGaussian outperforms state-of-the-art 3D Gaussian methods across PSNR, SSIM, and LPIPS, demonstrating improved target-specific novel view synthesis with reduced artifacts and background leakage.

Abstract

Recent advances in Gaussian Splatting have significantly advanced the field, achieving both panoptic and interactive segmentation of 3D scenes. However, existing methodologies often overlook the critical need for reconstructing specified targets with complex structures from sparse views. To address this issue, we introduce TSGaussian, a novel framework that combines semantic constraints with depth priors to avoid geometry degradation in challenging novel view synthesis tasks. Our approach prioritizes computational resources on designated targets while minimizing background allocation. Bounding boxes from YOLOv9 serve as prompts for Segment Anything Model to generate 2D mask predictions, ensuring semantic accuracy and cost efficiency. TSGaussian effectively clusters 3D gaussians by introducing a compact identity encoding for each Gaussian ellipsoid and incorporating 3D spatial consistency regularization. Leveraging these modules, we propose a pruning strategy to effectively reduce redundancy in 3D gaussians. Extensive experiments demonstrate that TSGaussian outperforms state-of-the-art methods on three standard datasets and a new challenging dataset we collected, achieving superior results in novel view synthesis of specific objects. Code is available at: https://github.com/leon2000-ai/TSGaussian.

Paper Structure

This paper contains 15 sections, 11 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Our framework first takes a 360° sparse image sequence as input, using YOLOv9 and SAM to obtain target masks, and a depth estimator to generate depth maps. Next, a general tracking model aligns the identity masks across frames. The framework then randomly generates an initial Gaussians and optimizes the Gaussian field using 2D identity loss, 3D regularization loss, semantic control, and pruning, while performing depth regularization. The final Gaussian field enables depth-accurate and semantically rich target view synthesis.
  • Figure 2: SAM-based panoramic segmentation can recognize common scenes, while SAM with prompts can easily extend to custom scenes and provide more complete masks.
  • Figure 3: The training process based on 2D views pays limited attention to depth errors, which can lead to inaccuracies during the training of sparse views. We employ a combination of global and local depth regularization to reduce artifacts, aiding in the acquisition of a model with more precise depth accuracy.
  • Figure 4: Result for 3D reconstruction of specific semantic targets under sparse-view. TSGaussian excels by generating high-quality novel views of specific targets while preserving fine model details.
  • Figure 5: A comparison of view segmentation between Gaussian Grouping and our proposed method in rendering. The masks predicted by Gaussian Grouping exhibit significant errors due to geometric degradation caused by sparse views, resulting in occlusion by artifacts. In contrast, our method, enhanced by semantic constraints and depth regularization, substantially reduces these artifacts. The identity encoding features in the bottom row are visualized using Principal Component Analysis (PCA).
  • ...and 1 more figures