Table of Contents
Fetching ...

SPOT-Occ: Sparse Prototype-guided Transformer for Camera-based 3D Occupancy Prediction

Suzeyu Chen, Leheng Li, Ying-Cong Chen

TL;DR

SPOT-Occ tackles the decoder bottleneck in camera-based 3D occupancy by replacing dense cross-attention with a Sparse Prototype-guided Transformer that selects a compact set of salient voxel features per query. A two-stage process—Deformable Top-$\rho$% Prototype Selection and a denoising training paradigm—ensures stable, object-aware aggregation while reducing complexity from $\mathcal{O}(N_q N_v)$ to $\mathcal{O}(N_q k)$, where $k=\lceil \rho N_v \rceil$. The model achieves state-of-the-art or competitive accuracy on nuScenes-Occupancy and SemanticKITTI while delivering substantial latency reductions (e.g., 57.6% faster than GaussianFormer-2 on nuScenes), confirming practical benefits for real-time autonomous driving. The work provides a complete end-to-end framework with strong architectural and training contributions, plus public code for reproducibility.

Abstract

Achieving highly accurate and real-time 3D occupancy prediction from cameras is a critical requirement for the safe and practical deployment of autonomous vehicles. While this shift to sparse 3D representations solves the encoding bottleneck, it creates a new challenge for the decoder: how to efficiently aggregate information from a sparse, non-uniformly distributed set of voxel features without resorting to computationally prohibitive dense attention. In this paper, we propose a novel Prototype-based Sparse Transformer Decoder that replaces this costly interaction with an efficient, two-stage process of guided feature selection and focused aggregation. Our core idea is to make the decoder's attention prototype-guided. We achieve this through a sparse prototype selection mechanism, where each query adaptively identifies a compact set of the most salient voxel features, termed prototypes, for focused feature aggregation. To ensure this dynamic selection is stable and effective, we introduce a complementary denoising paradigm. This approach leverages ground-truth masks to provide explicit guidance, guaranteeing a consistent query-prototype association across decoder layers. Our model, dubbed SPOT-Occ, outperforms previous methods with a significant margin in speed while also improving accuracy. Source code is released at https://github.com/chensuzeyu/SpotOcc.

SPOT-Occ: Sparse Prototype-guided Transformer for Camera-based 3D Occupancy Prediction

TL;DR

SPOT-Occ tackles the decoder bottleneck in camera-based 3D occupancy by replacing dense cross-attention with a Sparse Prototype-guided Transformer that selects a compact set of salient voxel features per query. A two-stage process—Deformable Top-% Prototype Selection and a denoising training paradigm—ensures stable, object-aware aggregation while reducing complexity from to , where . The model achieves state-of-the-art or competitive accuracy on nuScenes-Occupancy and SemanticKITTI while delivering substantial latency reductions (e.g., 57.6% faster than GaussianFormer-2 on nuScenes), confirming practical benefits for real-time autonomous driving. The work provides a complete end-to-end framework with strong architectural and training contributions, plus public code for reproducibility.

Abstract

Achieving highly accurate and real-time 3D occupancy prediction from cameras is a critical requirement for the safe and practical deployment of autonomous vehicles. While this shift to sparse 3D representations solves the encoding bottleneck, it creates a new challenge for the decoder: how to efficiently aggregate information from a sparse, non-uniformly distributed set of voxel features without resorting to computationally prohibitive dense attention. In this paper, we propose a novel Prototype-based Sparse Transformer Decoder that replaces this costly interaction with an efficient, two-stage process of guided feature selection and focused aggregation. Our core idea is to make the decoder's attention prototype-guided. We achieve this through a sparse prototype selection mechanism, where each query adaptively identifies a compact set of the most salient voxel features, termed prototypes, for focused feature aggregation. To ensure this dynamic selection is stable and effective, we introduce a complementary denoising paradigm. This approach leverages ground-truth masks to provide explicit guidance, guaranteeing a consistent query-prototype association across decoder layers. Our model, dubbed SPOT-Occ, outperforms previous methods with a significant margin in speed while also improving accuracy. Source code is released at https://github.com/chensuzeyu/SpotOcc.
Paper Structure (14 sections, 9 equations, 6 figures, 6 tables)

This paper contains 14 sections, 9 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: A quantitative benchmark of mIoU and latency on the nuScenes-Occupancy validation set. All latency are measured on a single NVIDIA RTX 3090 GPU with a batch size of 1, and * denotes results obtained by running the officially released code.
  • Figure 2: The matrix circles visualize query-key interactions; blue circles denote computed attention, while gray represent masked connections. (a) Dense Attention interacts with all voxel features, leading to prohibitive cubic complexity. (b) Sparse Attention masks empty voxels yet computes a full-sized attention matrix, limiting efficiency. (c) Our Sparse Prototype Selection directly selects a compact set of salient features (prototypes) for each query, dramatically reducing complexity for efficient aggregation.
  • Figure 3: Overview of the proposed SPOT-Occ framework. (a) An image backbone he2016deep extracts multi-scale 2D features that are lifted to a sparse 3D space by LSS philion2020lift and refined with a sparse convolutional backbone tang2024sparseocc. (b) The decoder in our Sparse Prototype-guided Transformer Head refines queries via Sparse Prototype Selection under dual supervision. This includes a Denoising Head that leverages noised queries during training only, adding no overhead at inference time.
  • Figure 4: Cross-Attention of Sparse Prototype-guided Transformer Decoder (SPOT-CA). To efficiently process large-scale 3D features, our cross-attention mechanism introduces a sparse prototype selection step. Rather than comparing every query to every voxel feature, we first identify the Top-$\rho$% most relevant features for each query. By creating sparse prototypes from only these key features, the decoder can refine the queries with significantly less computational cost while maintaining high performance.
  • Figure 5: This figure illustrates how the same query predicts significantly inconsistent masks across consecutive decoder layers (Layer 5 vs. Layer 6). This observed instability motivates our denoising training strategy to ensure stable optimization.
  • ...and 1 more figures