Table of Contents
Fetching ...

ST-GS: Vision-Based 3D Semantic Occupancy Prediction with Spatial-Temporal Gaussian Splatting

Xiaoyang Yan, Muleilan Pei, Shaojie Shen

TL;DR

A guidance-informed spatial aggregation strategy within a dual-mode attention mechanism to strengthen spatial interaction in Gaussian representations and a geometry-aware temporal fusion scheme that effectively leverages historical context to improve temporal continuity in scene completion are introduced.

Abstract

3D occupancy prediction is critical for comprehensive scene understanding in vision-centric autonomous driving. Recent advances have explored utilizing 3D semantic Gaussians to model occupancy while reducing computational overhead, but they remain constrained by insufficient multi-view spatial interaction and limited multi-frame temporal consistency. To overcome these issues, in this paper, we propose a novel Spatial-Temporal Gaussian Splatting (ST-GS) framework to enhance both spatial and temporal modeling in existing Gaussian-based pipelines. Specifically, we develop a guidance-informed spatial aggregation strategy within a dual-mode attention mechanism to strengthen spatial interaction in Gaussian representations. Furthermore, we introduce a geometry-aware temporal fusion scheme that effectively leverages historical context to improve temporal continuity in scene completion. Extensive experiments on the large-scale nuScenes occupancy prediction benchmark showcase that our proposed approach not only achieves state-of-the-art performance but also delivers markedly better temporal consistency compared to existing Gaussian-based methods.

ST-GS: Vision-Based 3D Semantic Occupancy Prediction with Spatial-Temporal Gaussian Splatting

TL;DR

A guidance-informed spatial aggregation strategy within a dual-mode attention mechanism to strengthen spatial interaction in Gaussian representations and a geometry-aware temporal fusion scheme that effectively leverages historical context to improve temporal continuity in scene completion are introduced.

Abstract

3D occupancy prediction is critical for comprehensive scene understanding in vision-centric autonomous driving. Recent advances have explored utilizing 3D semantic Gaussians to model occupancy while reducing computational overhead, but they remain constrained by insufficient multi-view spatial interaction and limited multi-frame temporal consistency. To overcome these issues, in this paper, we propose a novel Spatial-Temporal Gaussian Splatting (ST-GS) framework to enhance both spatial and temporal modeling in existing Gaussian-based pipelines. Specifically, we develop a guidance-informed spatial aggregation strategy within a dual-mode attention mechanism to strengthen spatial interaction in Gaussian representations. Furthermore, we introduce a geometry-aware temporal fusion scheme that effectively leverages historical context to improve temporal continuity in scene completion. Extensive experiments on the large-scale nuScenes occupancy prediction benchmark showcase that our proposed approach not only achieves state-of-the-art performance but also delivers markedly better temporal consistency compared to existing Gaussian-based methods.

Paper Structure

This paper contains 29 sections, 18 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Illustration of temporal inconsistency in occupancy prediction. In this example, the side camera views of the ego vehicle are heavily occluded by surrounding vehicles. The baseline method (GaussianFormer GaussianFormer) fails to track the identical truck (highlighted by the box) and produces discontinuous drivable surface predictions (highlighted by the ellipse) across frames. In contrast, our proposed ST-GS effectively integrates historical information, delivering accurate and consistent semantic occupancy predictions.
  • Figure 2: Overview of our ST-GS architecture, demonstrating how it enhances the existing Gaussian-based occupancy prediction model in multi-view spatial interaction and multi-frame temporal consistency.
  • Figure 3: Feature sampling paradigms of offsets for GGA and VGA.
  • Figure 4: Qualitative comparison of the baseline GaussianFormer GaussianFormer, GaussianFormer-2 huang2025gaussianformer2, and our proposed ST-GS. Visualization results of three-timestamp predictions from two distinct driving sequences show that ST-GS delivers more spatially accurate and temporally consistent semantic occupancy predictions.