Table of Contents
Fetching ...

Interpretable Single-View 3D Gaussian Splatting using Unsupervised Hierarchical Disentangled Representation Learning

Yuyang Zhang, Baao Xie, Hu Zhu, Qi Wang, Huanting Guo, Xin Jin, Wenjun Zeng

TL;DR

3DisGS tackles the interpretability gap in single-view 3D Gaussian Splatting by introducing a hierarchical DRL framework that discovers coarse- and fine-grained 3D semantics without supervision. It employs a dual-branch reconstruction (geometry via point clouds and appearance via triplane Gaussians) and DRL-based encoder-adapters to create orthogonal latent factors, aided by mutual information and style-guided modules to ensure view-consistent reconstructions. Experiments on ShapeNet and CO3D demonstrate effective 3D disentanglement with competitive reconstruction quality and efficiency, enabling semantic edits at both geometry and appearance levels. This work paves the way for controllable, semantically-aware 3D reconstructions from a single view, with potential extensions to environmental effects modeling.

Abstract

Gaussian Splatting (GS) has recently marked a significant advancement in 3D reconstruction, delivering both rapid rendering and high-quality results. However, existing 3DGS methods pose challenges in understanding underlying 3D semantics, which hinders model controllability and interpretability. To address it, we propose an interpretable single-view 3DGS framework, termed 3DisGS, to discover both coarse- and fine-grained 3D semantics via hierarchical disentangled representation learning (DRL). Specifically, the model employs a dual-branch architecture, consisting of a point cloud initialization branch and a triplane-Gaussian generation branch, to achieve coarse-grained disentanglement by separating 3D geometry and visual appearance features. Subsequently, fine-grained semantic representations within each modality are further discovered through DRL-based encoder-adapters. To our knowledge, this is the first work to achieve unsupervised interpretable 3DGS. Evaluations indicate that our model achieves 3D disentanglement while preserving high-quality and rapid reconstruction.

Interpretable Single-View 3D Gaussian Splatting using Unsupervised Hierarchical Disentangled Representation Learning

TL;DR

3DisGS tackles the interpretability gap in single-view 3D Gaussian Splatting by introducing a hierarchical DRL framework that discovers coarse- and fine-grained 3D semantics without supervision. It employs a dual-branch reconstruction (geometry via point clouds and appearance via triplane Gaussians) and DRL-based encoder-adapters to create orthogonal latent factors, aided by mutual information and style-guided modules to ensure view-consistent reconstructions. Experiments on ShapeNet and CO3D demonstrate effective 3D disentanglement with competitive reconstruction quality and efficiency, enabling semantic edits at both geometry and appearance levels. This work paves the way for controllable, semantically-aware 3D reconstructions from a single view, with potential extensions to environmental effects modeling.

Abstract

Gaussian Splatting (GS) has recently marked a significant advancement in 3D reconstruction, delivering both rapid rendering and high-quality results. However, existing 3DGS methods pose challenges in understanding underlying 3D semantics, which hinders model controllability and interpretability. To address it, we propose an interpretable single-view 3DGS framework, termed 3DisGS, to discover both coarse- and fine-grained 3D semantics via hierarchical disentangled representation learning (DRL). Specifically, the model employs a dual-branch architecture, consisting of a point cloud initialization branch and a triplane-Gaussian generation branch, to achieve coarse-grained disentanglement by separating 3D geometry and visual appearance features. Subsequently, fine-grained semantic representations within each modality are further discovered through DRL-based encoder-adapters. To our knowledge, this is the first work to achieve unsupervised interpretable 3DGS. Evaluations indicate that our model achieves 3D disentanglement while preserving high-quality and rapid reconstruction.

Paper Structure

This paper contains 26 sections, 18 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The comparison of (a) conventional 3DGS and (b) proposed 3DisGS. Traditional models are inherently non-interpretable, limiting 3D editing to pixel-level and relying heavily on extra priors (masks, bounding boxes, etc.). In contrast, 3DisGS employs hierarchical DRL to achieve interpretable 3D reconstruction unsupervisedly, which enables attribute manipulation at semantic-level.
  • Figure 2: The overview of 3DisGS. Given a single-view image $I$, a pretrained DINO-ViT is employed to extract rich features, which are subsequently compressed into compact, disentangled latent code $z_{disen}$ via DRL-based encoder. This interpretable code is adapted by DRL-based adapters to modality-specific forms and fed to two branches. The geometry branch generates point clouds, serving as the initialization for appearance branch to produce a triplane $T_{\text{init}}$. The triplane features are then decoded into 3D Gaussians. To improve reconstruction and disentanglement, a mutual information loss $\mathcal{L}_{\text{MI}}$ is applied among $z_{disen}$ and the reconstructed outputs.
  • Figure 3: Interpretable 3D reconstruction results. In (a), the left three columns present the results of single-view reconstruction on ShapeNet cars and chairs, while the subsequent four columns showcase fine-grained disentanglement of geometric attributes, including roofline and body straightness for cars, as well as armrest height and leg thickness for chairs. (b) demonstrates 3D disentanglement results on the visual appearance attributes including grayscale, body color and local color.
  • Figure 4: Qualitative comparison results. 3DisGS surpasses the baselines in 3D disentanglement, as it can manipulate the attributes while maintaining the integrity of irrelevant representations.
  • Figure 5: Ablation study on the Mutual Information (MI) Loss. The absence of the MI loss leads to observable artifacts.