Table of Contents
Fetching ...

OctreeOcc: Efficient and Multi-Granularity Occupancy Prediction Using Octree Queries

Yuhang Lu, Xinge Zhu, Tai Wang, Yuexin Ma

TL;DR

This paper tackles the inefficiency of dense voxel-based 3D occupancy prediction by introducing OctreeOcc, a multi-granularity octree-based framework that adaptively partitions space to match object sizes and scene details. It combines semantic-guided initialization with an iterative structure rectification mechanism, and employs deformable attention-based octree encoding to fuse temporal and multi-view features. OctreeOcc achieves state-of-the-art results on Occ3D-nuScenes and SemanticKITTI while reducing computational overhead by 15–24% compared to dense-grid methods. The work demonstrates the practicality of learning octree structures from images for 3D occupancy tasks and provides thorough ablations on initialization, rectification, and depth of the octree. Overall, the approach offers a scalable, accurate solution for holistic 3D scene understanding in autonomous systems.

Abstract

Occupancy prediction has increasingly garnered attention in recent years for its fine-grained understanding of 3D scenes. Traditional approaches typically rely on dense, regular grid representations, which often leads to excessive computational demands and a loss of spatial details for small objects. This paper introduces OctreeOcc, an innovative 3D occupancy prediction framework that leverages the octree representation to adaptively capture valuable information in 3D, offering variable granularity to accommodate object shapes and semantic regions of varying sizes and complexities. In particular, we incorporate image semantic information to improve the accuracy of initial octree structures and design an effective rectification mechanism to refine the octree structure iteratively. Our extensive evaluations show that OctreeOcc not only surpasses state-of-the-art methods in occupancy prediction, but also achieves a 15%-24% reduction in computational overhead compared to dense-grid-based methods.

OctreeOcc: Efficient and Multi-Granularity Occupancy Prediction Using Octree Queries

TL;DR

This paper tackles the inefficiency of dense voxel-based 3D occupancy prediction by introducing OctreeOcc, a multi-granularity octree-based framework that adaptively partitions space to match object sizes and scene details. It combines semantic-guided initialization with an iterative structure rectification mechanism, and employs deformable attention-based octree encoding to fuse temporal and multi-view features. OctreeOcc achieves state-of-the-art results on Occ3D-nuScenes and SemanticKITTI while reducing computational overhead by 15–24% compared to dense-grid methods. The work demonstrates the practicality of learning octree structures from images for 3D occupancy tasks and provides thorough ablations on initialization, rectification, and depth of the octree. Overall, the approach offers a scalable, accurate solution for holistic 3D scene understanding in autonomous systems.

Abstract

Occupancy prediction has increasingly garnered attention in recent years for its fine-grained understanding of 3D scenes. Traditional approaches typically rely on dense, regular grid representations, which often leads to excessive computational demands and a loss of spatial details for small objects. This paper introduces OctreeOcc, an innovative 3D occupancy prediction framework that leverages the octree representation to adaptively capture valuable information in 3D, offering variable granularity to accommodate object shapes and semantic regions of varying sizes and complexities. In particular, we incorporate image semantic information to improve the accuracy of initial octree structures and design an effective rectification mechanism to refine the octree structure iteratively. Our extensive evaluations show that OctreeOcc not only surpasses state-of-the-art methods in occupancy prediction, but also achieves a 15%-24% reduction in computational overhead compared to dense-grid-based methods.
Paper Structure (25 sections, 4 equations, 6 figures, 9 tables)

This paper contains 25 sections, 4 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Scale difference of various categories and octree representation. (a) presents a comparison of the average space occupied by each type of object, indicating different granularities are required for different semantic regions. (b) shows the superiority of octree representations, where we can apply specific granularity not only for different objects but also for different parts of the object, which can reduce computational overhead while preserving spatial information.
  • Figure 2: Overall framework of OctreeOcc. Given multi-view images, we extract multi-scale image features utilizing an image backbone. Subsequently, the initial octree structure is derived through image segmentation priors, and the transformation of dense queries into octree queries is effected. Following this, we concomitantly refine octree queries and rectify the octree structure through the octree encoder. Finally, we decode from the octree query and obtain occupancy prediction outcomes for this frame. For better visualisation, the diagram of Iterative Structure Rectification module shows octree query and mask in 2D form(quadtree).
  • Figure 3: Illustration of octree structure rectification. The left figure displays the initially predicted octree structure and the right figure depicts the octree structure after Iterative Structure Rectification. It is evident that the predicted octree structure becomes more consistent with the object's shape following the rectification module.
  • Figure 4: Qualitative results on Occ3D-nuScenes validation set, where the resolution of the voxel predictions is 200$\times$200$\times$16.
  • Figure 5: More visualization on Occ3D-nuScenes validation set. The first row displays input multi-view images, while the second row showcases the occupancy prediction results of PanoOccwang2023panoocc, FBOCCli2023fb, our methods, and the ground truth.
  • ...and 1 more figures