Table of Contents
Fetching ...

T-3DGS: Removing Transient Objects for 3D Scene Reconstruction

Alexander Markin, Vadim Pryadilshchikov, Artem Komarichev, Ruslan Rakhimov, Peter Wonka, Evgeny Burnaev

TL;DR

T-3DGS tackles the challenge of removing transient objects from video to enable high-fidelity 3D scene reconstruction with Gaussian Splatting. It introduces an unsupervised Reconstruction Uncertainty Predictor (RUP) that uses semantic features from DINOv2 and a bivariate residual model to identify transient regions, complemented by a KL-divergence-based regularization for robust mask generation. A Segmentation- and SAM-based Transient Mask Refinement (TMR) pipeline propagates and refines masks across frames to handle semi-transient objects, while depth-aware regularization reduces artifacts near boundaries. The combination of these components yields artifact-free reconstructions with improved temporal coherence and boundary accuracy, outperforming prior methods on both sparsely and densely captured datasets. The approach advances robust 3D scene reconstruction in uncontrolled real-world settings and provides a dataset and evaluation framework for semi-transient scenarios.

Abstract

Transient objects in video sequences can significantly degrade the quality of 3D scene reconstructions. To address this challenge, we propose T-3DGS, a novel framework that robustly filters out transient distractors during 3D reconstruction using Gaussian Splatting. Our framework consists of two steps. First, we employ an unsupervised classification network that distinguishes transient objects from static scene elements by leveraging their distinct training dynamics within the reconstruction process. Second, we refine these initial detections by integrating an off-the-shelf segmentation method with a bidirectional tracking module, which together enhance boundary accuracy and temporal coherence. Evaluations on both sparsely and densely captured video datasets demonstrate that T-3DGS significantly outperforms state-of-the-art approaches, enabling high-fidelity 3D reconstructions in challenging, real-world scenarios.

T-3DGS: Removing Transient Objects for 3D Scene Reconstruction

TL;DR

T-3DGS tackles the challenge of removing transient objects from video to enable high-fidelity 3D scene reconstruction with Gaussian Splatting. It introduces an unsupervised Reconstruction Uncertainty Predictor (RUP) that uses semantic features from DINOv2 and a bivariate residual model to identify transient regions, complemented by a KL-divergence-based regularization for robust mask generation. A Segmentation- and SAM-based Transient Mask Refinement (TMR) pipeline propagates and refines masks across frames to handle semi-transient objects, while depth-aware regularization reduces artifacts near boundaries. The combination of these components yields artifact-free reconstructions with improved temporal coherence and boundary accuracy, outperforming prior methods on both sparsely and densely captured datasets. The approach advances robust 3D scene reconstruction in uncontrolled real-world settings and provides a dataset and evaluation framework for semi-transient scenarios.

Abstract

Transient objects in video sequences can significantly degrade the quality of 3D scene reconstructions. To address this challenge, we propose T-3DGS, a novel framework that robustly filters out transient distractors during 3D reconstruction using Gaussian Splatting. Our framework consists of two steps. First, we employ an unsupervised classification network that distinguishes transient objects from static scene elements by leveraging their distinct training dynamics within the reconstruction process. Second, we refine these initial detections by integrating an off-the-shelf segmentation method with a bidirectional tracking module, which together enhance boundary accuracy and temporal coherence. Evaluations on both sparsely and densely captured video datasets demonstrate that T-3DGS significantly outperforms state-of-the-art approaches, enabling high-fidelity 3D reconstructions in challenging, real-world scenarios.

Paper Structure

This paper contains 28 sections, 13 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Overview of the Proposed T-3DGS Architecture. We introduce a modified version of 3D Gaussian Splatting, incorporating a masked loss term $\mathcal{L}_{\text{masked}}$ as described in Eq. \ref{['masked_loss']}. In each iteration, we start by rendering a reconstruction of a randomly sampled reference image. We compute residuals, along with DINOv2 features from both the ground truth and rendered images. These features are then fed to our RUP model to predict the per-pixel covariance matrix for both images. We calculate binary masks based on the divergence of these distributions (as specified in Eq. \ref{['divergence_criterion']}). Subsequently, we compute the likelihood as described in Eq. \ref{['likelihood']} and update the parameters of the RUP model via backpropagation, as indicated by the dashed lines. Additionaly, for some scenes, we incorporate a SAM-based mask refiner module (TMR), which further enhances the consistency and sharpness of the masks.
  • Figure 2: During the initial stages of reconstruction, RUP predicts high uncertainty in challenging regions such as backgrounds or high frequency details. However, since RUP relies exclusively on semantic information, calculating the divergence between reference uncertainty $\Sigma$ and reconstructed uncertainty $\hat{\Sigma}$ effectively suppresses these artifacts. Areas with divergence values above the threshold are highlighted in red, while the final predicted transient mask by RUP is shown in green.
  • Figure 3: Qualitative results on the On-the-go dataset. Our method outperforms existing approaches in detecting transient objects. Predicted transient masks are shown in green.
  • Figure 4: Qualitative results on the T-3DGS dataset. Our method produces cleaner transient masks and further refines them using the (TMR) module.
  • Figure 5: Qualitative results on the On-the-go dataset using the training frames. Our method produces higher-quality renderings without artifacts.
  • ...and 3 more figures