Table of Contents
Fetching ...

GaussianSSC: Triplane-Guided Directional Gaussian Fields for 3D Semantic Completion

Ruiqi Xian, Jing Liang, He Yin, Xuewei Qi, Dinesh Manocha

Abstract

We present \emph{GaussianSSC}, a two-stage, grid-native and triplane-guided approach to semantic scene completion (SSC) that injects the benefits of Gaussians without replacing the voxel grid or maintaining a separate Gaussian set. We introduce \emph{Gaussian Anchoring}, a sub-pixel, Gaussian-weighted image aggregation over fused FPN features that tightens voxel--image alignment and improves monocular occupancy estimation. We further convert point-like voxel features into a learned per-voxel Gaussian field and refine triplane features via a triplane-aligned \emph{Gaussian--Triplane Refinement} module that combines \emph{local gathering} (target-centric) and \emph{global aggregation} (source-centric). This directional, anisotropic support captures surface tangency, scale, and occlusion-aware asymmetry while preserving the efficiency of triplane representations. On SemanticKITTI~\cite{behley2019semantickitti}, GaussianSSC improves Stage~1 occupancy by +1.0\% Recall, +2.0\% Precision, and +1.8\% IoU over state-of-the-art baselines, and improves Stage~2 semantic prediction by +1.8\% IoU and +0.8\% mIoU.

GaussianSSC: Triplane-Guided Directional Gaussian Fields for 3D Semantic Completion

Abstract

We present \emph{GaussianSSC}, a two-stage, grid-native and triplane-guided approach to semantic scene completion (SSC) that injects the benefits of Gaussians without replacing the voxel grid or maintaining a separate Gaussian set. We introduce \emph{Gaussian Anchoring}, a sub-pixel, Gaussian-weighted image aggregation over fused FPN features that tightens voxel--image alignment and improves monocular occupancy estimation. We further convert point-like voxel features into a learned per-voxel Gaussian field and refine triplane features via a triplane-aligned \emph{Gaussian--Triplane Refinement} module that combines \emph{local gathering} (target-centric) and \emph{global aggregation} (source-centric). This directional, anisotropic support captures surface tangency, scale, and occlusion-aware asymmetry while preserving the efficiency of triplane representations. On SemanticKITTI~\cite{behley2019semantickitti}, GaussianSSC improves Stage~1 occupancy by +1.0\% Recall, +2.0\% Precision, and +1.8\% IoU over state-of-the-art baselines, and improves Stage~2 semantic prediction by +1.8\% IoU and +0.8\% mIoU.
Paper Structure (30 sections, 22 equations, 5 figures, 5 tables)

This paper contains 30 sections, 22 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Our GaussianSSC takes a monocular input, learns semantics efficiently via a triplane representation, then refines these features with Gaussian primitives to capture geometry and fine details, before fusing into a voxel space for 3D semantic scene completion.
  • Figure 2: Overview of Our GaussianSSC: We present GaussianSSC, a two‑stage pipeline for semantic scene completion. In Stage 1, it predicts an occupancy map from a monocular image, which serves as a structural prior for Stage 2. In Stage 2, we instantiate Gaussian embeddings at voxel locations, gate them by the occupancy priors, and condition three orthogonal triplanes to multi‑scale image features. We then decode a 3D Gaussian per voxel and splat it into the triplanes to perform Gaussian–Triplane refinement to produce stronger semantic features (see Figure \ref{['fig:stage_1_triplane']}). The refined triplane features are lifted and merged back to voxel space, where a semantic head predicts the final dense semantic map.
  • Figure 3: Illustration of Stage1: Stage 1: From a monocular image, we build a query‑conditioned triplane to obtain voxel descriptors, then apply Gaussian Anchoring—per‑voxel Gaussian windowing on a fused FPN map—to gather sub‑pixel image evidence, fuse it via a gated residual, and predict occupancy with a lightweight 3D head.
  • Figure 4: Illustration of Gaussian-Triplane Refinement: We associate each voxel with a Gaussian centered at the voxel and project this Gaussian onto the three orthogonal triplanes. On each plane, we refine the feature at the projected mean with two complementary steps: (i) local gathering, which anchors the feature by aggregating neighboring evidence within the Gaussian field; and (ii) global aggregation, which shares semantic information from all other locations whose Gaussians cover that point.
  • Figure 5: Visualization results: The figure compares semantic maps generated by different approaches. The reference LiDAR sample is constructed from multiple consecutive LiDAR frames. Blue boxes highlight cases where our method completes the scene more effectively, while red boxes indicate more accurate semantic estimations than other approaches.