Table of Contents
Fetching ...

UniSem: Generalizable Semantic 3D Reconstruction from Sparse Unposed Images

Guibiao Liao, Qian Ren, Kaimin Liao, Hua Wang, Zhi Chen, Luchao Wang, Yaohua Tang

Abstract

Semantic-aware 3D reconstruction from sparse, unposed images remains challenging for feed-forward 3D Gaussian Splatting (3DGS). Existing methods often predict an over-complete set of Gaussian primitives under sparse-view supervision, leading to unstable geometry and inferior depth quality. Meanwhile, they rely solely on 2D segmenter features for semantic lifting, which provides weak 3D-level and limited generalizable supervision, resulting in incomplete 3D semantics in novel scenes. To address these issues, we propose UniSem, a unified framework that jointly improves depth accuracy and semantic generalization via two key components. First, Error-aware Gaussian Dropout (EGD) performs error-guided capacity control by suppressing redundancy-prone Gaussians using rendering error cues, producing meaningful, geometrically stable Gaussian representations for improved depth estimation. Second, we introduce a Mix-training Curriculum (MTC) that progressively blends 2D segmenter-lifted semantics with the model's own emergent 3D semantic priors, implemented with object-level prototype alignment to enhance semantic coherence and completeness. Extensive experiments on ScanNet and Replica show that UniSem achieves superior performance in depth prediction and open-vocabulary 3D segmentation across varying numbers of input views. Notably, with 16-view inputs, UniSem reduces depth Rel by 15.2% and improves open-vocabulary segmentation mAcc by 3.7% over strong baselines.

UniSem: Generalizable Semantic 3D Reconstruction from Sparse Unposed Images

Abstract

Semantic-aware 3D reconstruction from sparse, unposed images remains challenging for feed-forward 3D Gaussian Splatting (3DGS). Existing methods often predict an over-complete set of Gaussian primitives under sparse-view supervision, leading to unstable geometry and inferior depth quality. Meanwhile, they rely solely on 2D segmenter features for semantic lifting, which provides weak 3D-level and limited generalizable supervision, resulting in incomplete 3D semantics in novel scenes. To address these issues, we propose UniSem, a unified framework that jointly improves depth accuracy and semantic generalization via two key components. First, Error-aware Gaussian Dropout (EGD) performs error-guided capacity control by suppressing redundancy-prone Gaussians using rendering error cues, producing meaningful, geometrically stable Gaussian representations for improved depth estimation. Second, we introduce a Mix-training Curriculum (MTC) that progressively blends 2D segmenter-lifted semantics with the model's own emergent 3D semantic priors, implemented with object-level prototype alignment to enhance semantic coherence and completeness. Extensive experiments on ScanNet and Replica show that UniSem achieves superior performance in depth prediction and open-vocabulary 3D segmentation across varying numbers of input views. Notably, with 16-view inputs, UniSem reduces depth Rel by 15.2% and improves open-vocabulary segmentation mAcc by 3.7% over strong baselines.
Paper Structure (14 sections, 18 equations, 9 figures, 10 tables)

This paper contains 14 sections, 18 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: UniSem, a feedforward Gaussian model for 3D semantic-aware reconstruction, predicts a unified 3D Gaussian scene representation from unposed images in a single forward pass, enabling novel view synthesis, depth estimation, and open-vocabulary segmentation. Prior work Uni3R uni3r predicts pixel-aligned Gaussians and relies solely on 2D LSeg features LSeg for semantic lifting, resulting in noisy depth and limited semantic generalization (golden dashed box). In contrast, UniSem suppresses redundant Gaussians and complements 2D lifting with emergent 3D semantic cues, yielding more accurate depth and more coherent, complete 3D semantics.
  • Figure 2: Visual results under different settings in feed-forward 3DGS. (a) Ground truth. (b) Baseline: redundant Gaussians persist, leading to noisy depth. (c) Removal of a subset of Gaussians in low-error regions (with short re-optimization): appearance changes are minimal while depth noise is reduced. (d) After continued training with redundancy suppression, the model yields more stable and accurate depth.
  • Figure 3: Visual segmentation results of different models.
  • Figure 4: Overview of UniSem. Given a set of unposed input views, a ViT-based encoder extracts multi-view features, which are fed into the decoder and DPT heads to produce pixel-aligned 3D Gaussian primitives. (a) To stabilize depth estimation, we introduce Error-aware Gaussian Dropout (EGD), which computes per-pixel reconstruction error and suppresses overly dense Gaussians via progressive, error-guided dropout, reducing gradient noise during training. (b) To enhance 3D semantic consistency and cross-scene generalization, the proposed Mix-training Curriculum (MTC) transitions from pure 2D segmenter-lifted supervision to mixed 2D-3D semantic training. MTC leverages (i) max-error prompting for cross-view object correspondence, and (ii) view-to-view and geometry-aware prototype alignment, enforcing more coherent semantic Gaussian representations.
  • Figure 5: Depth estimation results on novel views.
  • ...and 4 more figures