SPOT-Occ: Sparse Prototype-guided Transformer for Camera-based 3D Occupancy Prediction
Suzeyu Chen, Leheng Li, Ying-Cong Chen
TL;DR
SPOT-Occ tackles the decoder bottleneck in camera-based 3D occupancy by replacing dense cross-attention with a Sparse Prototype-guided Transformer that selects a compact set of salient voxel features per query. A two-stage process—Deformable Top-$\rho$% Prototype Selection and a denoising training paradigm—ensures stable, object-aware aggregation while reducing complexity from $\mathcal{O}(N_q N_v)$ to $\mathcal{O}(N_q k)$, where $k=\lceil \rho N_v \rceil$. The model achieves state-of-the-art or competitive accuracy on nuScenes-Occupancy and SemanticKITTI while delivering substantial latency reductions (e.g., 57.6% faster than GaussianFormer-2 on nuScenes), confirming practical benefits for real-time autonomous driving. The work provides a complete end-to-end framework with strong architectural and training contributions, plus public code for reproducibility.
Abstract
Achieving highly accurate and real-time 3D occupancy prediction from cameras is a critical requirement for the safe and practical deployment of autonomous vehicles. While this shift to sparse 3D representations solves the encoding bottleneck, it creates a new challenge for the decoder: how to efficiently aggregate information from a sparse, non-uniformly distributed set of voxel features without resorting to computationally prohibitive dense attention. In this paper, we propose a novel Prototype-based Sparse Transformer Decoder that replaces this costly interaction with an efficient, two-stage process of guided feature selection and focused aggregation. Our core idea is to make the decoder's attention prototype-guided. We achieve this through a sparse prototype selection mechanism, where each query adaptively identifies a compact set of the most salient voxel features, termed prototypes, for focused feature aggregation. To ensure this dynamic selection is stable and effective, we introduce a complementary denoising paradigm. This approach leverages ground-truth masks to provide explicit guidance, guaranteeing a consistent query-prototype association across decoder layers. Our model, dubbed SPOT-Occ, outperforms previous methods with a significant margin in speed while also improving accuracy. Source code is released at https://github.com/chensuzeyu/SpotOcc.
