Table of Contents
Fetching ...

ProtoOcc: Accurate, Efficient 3D Occupancy Prediction Using Dual Branch Encoder-Prototype Query Decoder

Jungho Kim, Changwon Kang, Dongyoung Lee, Sehwan Choi, Jun Won Choi

TL;DR

ProtoOcc addresses the challenge of accurate 3D occupancy prediction from multi-view imagery by introducing a dual-branch encoder that fuses voxel and BEV representations and a prototype-based single-pass decoder. The DBE captures fine-grained geometry while BEV provides broad context, and PQD leverages Scene-Adaptive and Scene-Agnostic Prototypes to skip iterative decoding. Robust Prototype Learning further improves robustness by training with noise in prototype generation. On Occ3D-nuScenes, ProtoOcc achieves state-of-the-art $mIoU$ with single-frame $39.56\%$ and multi-frame $45.02\%$, while maintaining real-time inference on RTX 3090, demonstrating strong accuracy and efficiency for practical autonomous-driving systems.

Abstract

In this paper, we introduce ProtoOcc, a novel 3D occupancy prediction model designed to predict the occupancy states and semantic classes of 3D voxels through a deep semantic understanding of scenes. ProtoOcc consists of two main components: the Dual Branch Encoder (DBE) and the Prototype Query Decoder (PQD). The DBE produces a new 3D voxel representation by combining 3D voxel and BEV representations across multiple scales through a dual branch structure. This design enhances both performance and computational efficiency by providing a large receptive field for the BEV representation while maintaining a smaller receptive field for the voxel representation. The PQD introduces Prototype Queries to accelerate the decoding process. Scene-Adaptive Prototypes are derived from the 3D voxel features of input sample, while Scene-Agnostic Prototypes are computed by applying Scene-Adaptive Prototypes to an Exponential Moving Average during the training phase. By using these prototype-based queries for decoding, we can directly predict 3D occupancy in a single step, eliminating the need for iterative Transformer decoding. Additionally, we propose the Robust Prototype Learning, which injects noise into prototype generation process and trains the model to denoise during the training phase. ProtoOcc achieves state-of-the-art performance with 45.02% mIoU on the Occ3D-nuScenes benchmark. For single-frame method, it reaches 39.56% mIoU with an inference speed of 12.83 FPS on an NVIDIA RTX 3090. Our code can be found at https://github.com/SPA-junghokim/ProtoOcc.

ProtoOcc: Accurate, Efficient 3D Occupancy Prediction Using Dual Branch Encoder-Prototype Query Decoder

TL;DR

ProtoOcc addresses the challenge of accurate 3D occupancy prediction from multi-view imagery by introducing a dual-branch encoder that fuses voxel and BEV representations and a prototype-based single-pass decoder. The DBE captures fine-grained geometry while BEV provides broad context, and PQD leverages Scene-Adaptive and Scene-Agnostic Prototypes to skip iterative decoding. Robust Prototype Learning further improves robustness by training with noise in prototype generation. On Occ3D-nuScenes, ProtoOcc achieves state-of-the-art with single-frame and multi-frame , while maintaining real-time inference on RTX 3090, demonstrating strong accuracy and efficiency for practical autonomous-driving systems.

Abstract

In this paper, we introduce ProtoOcc, a novel 3D occupancy prediction model designed to predict the occupancy states and semantic classes of 3D voxels through a deep semantic understanding of scenes. ProtoOcc consists of two main components: the Dual Branch Encoder (DBE) and the Prototype Query Decoder (PQD). The DBE produces a new 3D voxel representation by combining 3D voxel and BEV representations across multiple scales through a dual branch structure. This design enhances both performance and computational efficiency by providing a large receptive field for the BEV representation while maintaining a smaller receptive field for the voxel representation. The PQD introduces Prototype Queries to accelerate the decoding process. Scene-Adaptive Prototypes are derived from the 3D voxel features of input sample, while Scene-Agnostic Prototypes are computed by applying Scene-Adaptive Prototypes to an Exponential Moving Average during the training phase. By using these prototype-based queries for decoding, we can directly predict 3D occupancy in a single step, eliminating the need for iterative Transformer decoding. Additionally, we propose the Robust Prototype Learning, which injects noise into prototype generation process and trains the model to denoise during the training phase. ProtoOcc achieves state-of-the-art performance with 45.02% mIoU on the Occ3D-nuScenes benchmark. For single-frame method, it reaches 39.56% mIoU with an inference speed of 12.83 FPS on an NVIDIA RTX 3090. Our code can be found at https://github.com/SPA-junghokim/ProtoOcc.

Paper Structure

This paper contains 45 sections, 7 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Comparisons of the mIoU and runtimes of different methods on the Occ3D-nuScenes validation set. $\star$ indicates results reproduced using publicly available codes. Inference time is measured on a single NVIDIA RTX 3090 GPU.
  • Figure 2: Overall structure of ProtoOcc. (a) Dual Branch Encoder captures fine-grained 3D structures and models the large receptive fields in voxel and BEV domains, respectively. (b) The Prototype Query Decoder generates Scene-Aware Queries utilizing prototypes and achieves fast inference without iterative query decoding. (c) Our ProtoOcc framework integrates Dual Branch Encoder and Prototype Mask Decoder for 3D occupancy prediction.
  • Figure 3: Details of Dual Branch Encoder. (a) DBE consists of DFE and HFM. DFE extracts multi-scale features using the dual encoders in the voxel and BEV domain. HFM aggregates these features from low to high scales to generate Comprehensive Voxel Feature $V_{\text{CVF}}$. (b) The Large-Kernel BEV Block comprises a large kernel depth-wise convolution, 1x1 convolutions, and layer normalization.
  • Figure 4: Details of prototype generation. AdaPG generates Scene-Adaptive Prototypes by sampling and averaging Comprehensive Voxel Feature for each class based on class-specific masks. AgnoPG generates Scene-Agnostic Prototypes by computing Scene-Adaptive Prototypes through the EMA method. Finally, Scene-Adaptive Prototypes and Scene-Agnostic Prototypes are combined into Scene-Adaptive Queries.
  • Figure 5: Qualitative results on the Occ3D-nuScenes validation set. The regions marked by red ellipses and rectangles emphasize the superior results generated by our proposed model. The yellow arrow indicates the position and direction of the ego vehicle.
  • ...and 4 more figures