Table of Contents
Fetching ...

TGSFormer: Scalable Temporal Gaussian Splatting for Embodied Semantic Scene Completion

Rui Qian, Haozhi Cao, Tianchen Deng, Tianxin Hu, Weixiang Guo, Shenghai Yuan, Lihua Xie

TL;DR

Embodied semantic scene completion with Gaussian primitives often suffers from redundant memory and unbounded growth as exploration scales. The proposed TGSFormer maintains a persistent Gaussian memory, uses a Dual Temporal Encoder for confidence-aware temporal fusion, and applies Confidence-aware Voxel Fusion to keep memory compact, enabling scalable, frame-agnostic scene completion. Through a two-stage training regime (monocular pretraining followed by embodied fine-tuning) and extensive ablations, the method achieves state-of-the-art results on both monocular and embodied SSC benchmarks while using markedly fewer primitives. This framework advances memory-efficient, long-horizon 3D perception for embodied agents and provides practical pathways for robust large-scale scene understanding.

Abstract

Embodied 3D Semantic Scene Completion (SSC) infers dense geometry and semantics from continuous egocentric observations. Most existing Gaussian-based methods rely on random initialization of many primitives within predefined spatial bounds, resulting in redundancy and poor scalability to unbounded scenes. Recent depth-guided approach alleviates this issue but remains local, suffering from latency and memory overhead as scale increases. To overcome these challenges, we propose TGSFormer, a scalable Temporal Gaussian Splatting framework for embodied SSC. It maintains a persistent Gaussian memory for temporal prediction, without relying on image coherence or frame caches. For temporal fusion, a Dual Temporal Encoder jointly processes current and historical Gaussian features through confidence-aware cross-attention. Subsequently, a Confidence-aware Voxel Fusion module merges overlapping primitives into voxel-aligned representations, regulating density and maintaining compactness. Extensive experiments demonstrate that TGSFormer achieves state-of-the-art results on both local and embodied SSC benchmarks, offering superior accuracy and scalability with significantly fewer primitives while maintaining consistent long-term scene integrity. The code will be released upon acceptance.

TGSFormer: Scalable Temporal Gaussian Splatting for Embodied Semantic Scene Completion

TL;DR

Embodied semantic scene completion with Gaussian primitives often suffers from redundant memory and unbounded growth as exploration scales. The proposed TGSFormer maintains a persistent Gaussian memory, uses a Dual Temporal Encoder for confidence-aware temporal fusion, and applies Confidence-aware Voxel Fusion to keep memory compact, enabling scalable, frame-agnostic scene completion. Through a two-stage training regime (monocular pretraining followed by embodied fine-tuning) and extensive ablations, the method achieves state-of-the-art results on both monocular and embodied SSC benchmarks while using markedly fewer primitives. This framework advances memory-efficient, long-horizon 3D perception for embodied agents and provides practical pathways for robust large-scale scene understanding.

Abstract

Embodied 3D Semantic Scene Completion (SSC) infers dense geometry and semantics from continuous egocentric observations. Most existing Gaussian-based methods rely on random initialization of many primitives within predefined spatial bounds, resulting in redundancy and poor scalability to unbounded scenes. Recent depth-guided approach alleviates this issue but remains local, suffering from latency and memory overhead as scale increases. To overcome these challenges, we propose TGSFormer, a scalable Temporal Gaussian Splatting framework for embodied SSC. It maintains a persistent Gaussian memory for temporal prediction, without relying on image coherence or frame caches. For temporal fusion, a Dual Temporal Encoder jointly processes current and historical Gaussian features through confidence-aware cross-attention. Subsequently, a Confidence-aware Voxel Fusion module merges overlapping primitives into voxel-aligned representations, regulating density and maintaining compactness. Extensive experiments demonstrate that TGSFormer achieves state-of-the-art results on both local and embodied SSC benchmarks, offering superior accuracy and scalability with significantly fewer primitives while maintaining consistent long-term scene integrity. The code will be released upon acceptance.

Paper Structure

This paper contains 24 sections, 16 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Overview of embodied scene exploration and refinement. Our TGSFormer consistently expands its understanding of the environment as new views are observed and progressively refines previously seen regions, producing a complete and coherent 3D scene.
  • Figure 2: An overview of our proposed TGSFormer architecture. Our framework first employs parallel image and depth encoders to extract appearance features and geometry priors. These are passed to a Gaussian Lifter (Gs.Lifter) and a Gaussian Encoder (Gs.Encoder) to generate the current set of Gaussian primitives and embeddings. These primitives are then fed into our Dual Temporal Encoder (DTE). The DTE loads historical features queried from the global Gaussian Memory and processes both data streams using two weight-sharing Temporal Encoders. The fused representations are passed to our Confidence-aware Voxel Fusion (CAVF) module, which estimates per-primitive semantic and opacity uncertainty, then performs a confidence-weighted fusion to merge primitives and control density. Finally, an aggregator splats the merged Gaussians into the semantic voxel grid. These primitives are then used to update the global Gaussian Cache.
  • Figure 3: Feature alignment visualization with Principal Component Analysis (PCA). PCA projections of Gaussian features show that our multi-stage objective not only aligns intermediate representations toward the final encoder space, but also makes their distributions more isotropic and semantically organized.
  • Figure 4: Qualitative comparison of monocular prediction results on the Occ-ScanNet and Occ-ScanNet-mini dataset. TGSFormer reconstructs more complete geometry and captures semantics with higher clarity than existing approaches.
  • Figure 5: Global prediction visualization on the EmbodiedOcc-ScanNet-mini dataset. Our TGSFormer framework not only produces high-quality monocular completion but also consistently refines and completes the observed scene through temporal updates.
  • ...and 4 more figures