Table of Contents
Fetching ...

3D Occupancy Prediction with Low-Resolution Queries via Prototype-aware View Transformation

Gyeongrok Oh, Sungjune Kim, Heeju Ko, Hyung-gun Chi, Jinkyu Kim, Dongwook Lee, Daehyun Ji, Sungjoon Choi, Sujin Jang, Sangpil Kim

TL;DR

ProtoOcc tackles the challenge of maintaining high-quality 3D occupancy prediction when voxel query resolutions are reduced for real-time deployment. It introduces a prototype-aware view transformation to project high-level 2D image structures onto 3D voxel queries and employs a multi-perspective occupancy decoding strategy to disentangle compressed visual cues. Key contributions include prototype mapping and optimization with contrastive learning, and a multi-perspective decoding framework with scene-consistency regularization. Experimental results on Occ3D-nuScenes and SemanticKITTI demonstrate clear improvements over baselines and competitive performance even with 75% smaller voxel queries, highlighting the method's potential for efficient, robust camera-based 3D perception in real-world systems.

Abstract

The resolution of voxel queries significantly influences the quality of view transformation in camera-based 3D occupancy prediction. However, computational constraints and the practical necessity for real-time deployment require smaller query resolutions, which inevitably leads to an information loss. Therefore, it is essential to encode and preserve rich visual details within limited query sizes while ensuring a comprehensive representation of 3D occupancy. To this end, we introduce ProtoOcc, a novel occupancy network that leverages prototypes of clustered image segments in view transformation to enhance low-resolution context. In particular, the mapping of 2D prototypes onto 3D voxel queries encodes high-level visual geometries and complements the loss of spatial information from reduced query resolutions. Additionally, we design a multi-perspective decoding strategy to efficiently disentangle the densely compressed visual cues into a high-dimensional 3D occupancy scene. Experimental results on both Occ3D and SemanticKITTI benchmarks demonstrate the effectiveness of the proposed method, showing clear improvements over the baselines. More importantly, ProtoOcc achieves competitive performance against the baselines even with 75\% reduced voxel resolution.

3D Occupancy Prediction with Low-Resolution Queries via Prototype-aware View Transformation

TL;DR

ProtoOcc tackles the challenge of maintaining high-quality 3D occupancy prediction when voxel query resolutions are reduced for real-time deployment. It introduces a prototype-aware view transformation to project high-level 2D image structures onto 3D voxel queries and employs a multi-perspective occupancy decoding strategy to disentangle compressed visual cues. Key contributions include prototype mapping and optimization with contrastive learning, and a multi-perspective decoding framework with scene-consistency regularization. Experimental results on Occ3D-nuScenes and SemanticKITTI demonstrate clear improvements over baselines and competitive performance even with 75% smaller voxel queries, highlighting the method's potential for efficient, robust camera-based 3D perception in real-world systems.

Abstract

The resolution of voxel queries significantly influences the quality of view transformation in camera-based 3D occupancy prediction. However, computational constraints and the practical necessity for real-time deployment require smaller query resolutions, which inevitably leads to an information loss. Therefore, it is essential to encode and preserve rich visual details within limited query sizes while ensuring a comprehensive representation of 3D occupancy. To this end, we introduce ProtoOcc, a novel occupancy network that leverages prototypes of clustered image segments in view transformation to enhance low-resolution context. In particular, the mapping of 2D prototypes onto 3D voxel queries encodes high-level visual geometries and complements the loss of spatial information from reduced query resolutions. Additionally, we design a multi-perspective decoding strategy to efficiently disentangle the densely compressed visual cues into a high-dimensional 3D occupancy scene. Experimental results on both Occ3D and SemanticKITTI benchmarks demonstrate the effectiveness of the proposed method, showing clear improvements over the baselines. More importantly, ProtoOcc achieves competitive performance against the baselines even with 75\% reduced voxel resolution.

Paper Structure

This paper contains 19 sections, 6 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: (a) Our ProtoOcc can perform comparably to higher-resolution counterparts while using 75% less memory. (b-c) Reducing query resolutions in standard view transformation (VT) is required for faster inference, but brings geometrical ambiguity. (d) Our prototype-aware VT can capture high-level geometric details while preserving computational efficiency.
  • Figure 2: Prototype-aware View Transformation. (a) In the Prototype Mapping stage, we fully exploit the hierarchies of 2D image features via a clustering method to map 2D prototype representations onto 3D voxel query. (b) Contrastive learning on the prototype features based on the pseudo ground truth masks enhances the discrimination between the prototypes for better feature learning. Best viewed in color.
  • Figure 3: Predictions in challenging scenarios. We visualize the prediction comparisons between the baseline and our ProtoOcc. The corresponding camera view is highlighted in yellow, and we label important semantic classes on top. Best viewed in color.
  • Figure 4: Attention map visualization. Compared to the baseline, ProtoOcc can attend to more important details in the image (e.g. red dashed boxes), which is crucial for safe driving systems.
  • Figure 5: Feature map visualization. We visualize the learned voxel queries, which are average-pooled on the channel and the z-axis. The red box indicates the region we focus on.