GaussianBlock: Building Part-Aware Compositional and Editable 3D Scene by Primitives and Gaussians
Shuyi Jiang, Qihao Zhao, Hossein Rahmani, De Wen Soh, Jun Liu, Na Zhao
TL;DR
GaussianBlock addresses entangled latent representations in neural 3D reconstruction by introducing a semantically aware hybrid representation that couples flexibly editable superquadric primitives with high-fidelity 3D Gaussians. A novel Attention-guided Centering loss derived from 2D priors enforces semantic disentanglement of primitives, while dynamic splitting/fusion and a binding inheritance strategy maintain a tight connection between Gaussians and their associated primitives. The optimization proceeds in two stages: first stage minimizes $L_{first} = L_{rec} + γ L_{AC}$ to refine primitives, and second stage minimizes $L_{second} = L_{rgb} + L_{pos}$ to bind and refine Gaussians, enabling precise editing without sacrificing fidelity. Empirical results on DTU, Nerfstudio, BlendedMVS, and related benchmarks demonstrate state-of-the-art part-level decomposition and competitive fidelity with direct editability of components, supporting practical, building-block style 3D editing.
Abstract
Recently, with the development of Neural Radiance Fields and Gaussian Splatting, 3D reconstruction techniques have achieved remarkably high fidelity. However, the latent representations learnt by these methods are highly entangled and lack interpretability. In this paper, we propose a novel part-aware compositional reconstruction method, called GaussianBlock, that enables semantically coherent and disentangled representations, allowing for precise and physical editing akin to building blocks, while simultaneously maintaining high fidelity. Our GaussianBlock introduces a hybrid representation that leverages the advantages of both primitives, known for their flexible actionability and editability, and 3D Gaussians, which excel in reconstruction quality. Specifically, we achieve semantically coherent primitives through a novel attention-guided centering loss derived from 2D semantic priors, complemented by a dynamic splitting and fusion strategy. Furthermore, we utilize 3D Gaussians that hybridize with primitives to refine structural details and enhance fidelity. Additionally, a binding inheritance strategy is employed to strengthen and maintain the connection between the two. Our reconstructed scenes are evidenced to be disentangled, compositional, and compact across diverse benchmarks, enabling seamless, direct and precise editing while maintaining high quality.
