AlignGS: Aligning Geometry and Semantics for Robust Indoor Reconstruction from Sparse Views
Yijie Gao, Houqiang Zhong, Tianchi Zhu, Zhengxue Cheng, Qiang Hu, Li Song
TL;DR
This work tackles indoor semantic 3D reconstruction from sparse views by making semantics an active regularizer for geometry. It introduces AlignGS, which initializes a Gaussian-based scene with a SfM-free VGGT pipeline and jointly optimizes geometric and per-primitive semantic features using differentiable rendering, guided by semantic priors distilled from 2D foundation models. Key innovations include depth-consistency and boundary-aware normal regularization, plus a dual-semantic distillation mechanism from a DINOv2 teacher, enabling robust, coherent surfaces and improved novel-view synthesis. Evaluations on ScanNet and NRGBD demonstrate state-of-the-art performance in both rendering quality and geometric fidelity, with ablations confirming the additive benefits of each semantic-guided component. The approach enables reliable semantic indoor reconstructions from sparse inputs and supports downstream tasks such as object editing and semantic-aware digital-twin creation.
Abstract
The demand for semantically rich 3D models of indoor scenes is rapidly growing, driven by applications in augmented reality, virtual reality, and robotics. However, creating them from sparse views remains a challenge due to geometric ambiguity. Existing methods often treat semantics as a passive feature painted on an already-formed, and potentially flawed, geometry. We posit that for robust sparse-view reconstruction, semantic understanding instead be an active, guiding force. This paper introduces AlignGS, a novel framework that actualizes this vision by pioneering a synergistic, end-to-end optimization of geometry and semantics. Our method distills rich priors from 2D foundation models and uses them to directly regularize the 3D representation through a set of novel semantic-to-geometry guidance mechanisms, including depth consistency and multi-faceted normal regularization. Extensive evaluations on standard benchmarks demonstrate that our approach achieves state-of-the-art results in novel view synthesis and produces reconstructions with superior geometric accuracy. The results validate that leveraging semantic priors as a geometric regularizer leads to more coherent and complete 3D models from limited input views. Our code is avaliable at https://github.com/MediaX-SJTU/AlignGS .
