Table of Contents
Fetching ...

BEV-IO: Enhancing Bird's-Eye-View 3D Detection with Instance Occupancy

Zaibin Zhang, Yuanhang Zhang, Lijun Wang, Yifan Wang, Huchuan Lu

TL;DR

BEV-IO tackles the limited geometric completeness of depth-based BEV lifting by introducing instance occupancy prediction (IOP) to fill object interiors in frustum space. It couples explicit and implicit occupancy decoders in a two-branch geometry framework with a geometry-aware feature propagation (GFP) module that leverages occupancy cues to enrich image features before BEV lifting. The method delivers state-of-the-art performance gains on nuScenes with only negligible increases in parameters (~0.2%) and GFLOPs (~0.24%), demonstrating robust improvements in 3D detection while remaining computationally efficient. This occupancy-driven approach provides a practical and scalable path to more accurate BEV representations in camera-based 3D detection systems.

Abstract

A popular approach for constructing bird's-eye-view (BEV) representation in 3D detection is to lift 2D image features onto the viewing frustum space based on explicitly predicted depth distribution. However, depth distribution can only characterize the 3D geometry of visible object surfaces but fails to capture their internal space and overall geometric structure, leading to sparse and unsatisfactory 3D representations. To mitigate this issue, we present BEV-IO, a new 3D detection paradigm to enhance BEV representation with instance occupancy information. At the core of our method is the newly-designed instance occupancy prediction (IOP) module, which aims to infer point-level occupancy status for each instance in the frustum space. To ensure training efficiency while maintaining representational flexibility, it is trained using the combination of both explicit and implicit supervision. With the predicted occupancy, we further design a geometry-aware feature propagation mechanism (GFP), which performs self-attention based on occupancy distribution along each ray in frustum and is able to enforce instance-level feature consistency. By integrating the IOP module with GFP mechanism, our BEV-IO detector is able to render highly informative 3D scene structures with more comprehensive BEV representations. Experimental results demonstrate that BEV-IO can outperform state-of-the-art methods while only adding a negligible increase in parameters (0.2%) and computational overhead (0.24%in GFLOPs).

BEV-IO: Enhancing Bird's-Eye-View 3D Detection with Instance Occupancy

TL;DR

BEV-IO tackles the limited geometric completeness of depth-based BEV lifting by introducing instance occupancy prediction (IOP) to fill object interiors in frustum space. It couples explicit and implicit occupancy decoders in a two-branch geometry framework with a geometry-aware feature propagation (GFP) module that leverages occupancy cues to enrich image features before BEV lifting. The method delivers state-of-the-art performance gains on nuScenes with only negligible increases in parameters (~0.2%) and GFLOPs (~0.24%), demonstrating robust improvements in 3D detection while remaining computationally efficient. This occupancy-driven approach provides a practical and scalable path to more accurate BEV representations in camera-based 3D detection systems.

Abstract

A popular approach for constructing bird's-eye-view (BEV) representation in 3D detection is to lift 2D image features onto the viewing frustum space based on explicitly predicted depth distribution. However, depth distribution can only characterize the 3D geometry of visible object surfaces but fails to capture their internal space and overall geometric structure, leading to sparse and unsatisfactory 3D representations. To mitigate this issue, we present BEV-IO, a new 3D detection paradigm to enhance BEV representation with instance occupancy information. At the core of our method is the newly-designed instance occupancy prediction (IOP) module, which aims to infer point-level occupancy status for each instance in the frustum space. To ensure training efficiency while maintaining representational flexibility, it is trained using the combination of both explicit and implicit supervision. With the predicted occupancy, we further design a geometry-aware feature propagation mechanism (GFP), which performs self-attention based on occupancy distribution along each ray in frustum and is able to enforce instance-level feature consistency. By integrating the IOP module with GFP mechanism, our BEV-IO detector is able to render highly informative 3D scene structures with more comprehensive BEV representations. Experimental results demonstrate that BEV-IO can outperform state-of-the-art methods while only adding a negligible increase in parameters (0.2%) and computational overhead (0.24%in GFLOPs).
Paper Structure (22 sections, 7 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 22 sections, 7 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of our BEV-IO with existing explicit BEV-based detection methods. (a) Existing BEV-based methods utilize estimated depth weights to lift image features onto the BEV space, where the depth weights only feature visible surfaces. (b) We introduce the occupancy weights upon the depth weights to obtain more complete and precise BEV representations. In addition, we propose the propagation of image features with occupancy cues to attain geometry-aware image features.
  • Figure 2: Overall pipeline of our BEV-IO. BEV-IO is mainly composed of a 3D geometry branch and a feature propagation branch. (1) 3D geometry branch utilizes the image features as input to estimate depth and explicit/implicit instance occupancy weights. Subsequently, these weights are fused to generate depth-occupancy weights. (2) The feature propagation branch takes image features and explicit instance occupancy weights as input, and a geometry-aware propagation module is performed to further enhance the image features with geometry cues. Finally, obtained geometry-aware features are lifted onto the BEV space using the depth-occupancy weights. Obtained BEV features are fed into the detection head to obtain the final detection results.
  • Figure 3: Illustration of our 3D geometry branch. The 3D geometry branch takes image features as input to predict depth, explicit and implicit instance occupancy weights. Depth and explicit instance occupancy are supervised by ground truth depth and generated ground truth explicit occupancy respectively. The implicit occupancy weights are only supervised by final detection loss. The depth-occupancy weights are the weighted sum of these three weights.
  • Figure 4: Geometry-aware feature propagation mechanism. The input of geometry-aware feature propagation mechanism (GFP) is the predicted explicit instance occupancy and image features. GFP takes the explicit occupancy tokens as the key and value, performs self-attention, and conducts geometry-aware feature propagation on the image features.
  • Figure 5: Visualization of detection performance. The predicted and ground truth bounding boxes are marked in yellow and green, respectively. BEV-IO obtains more accurate predictions compared with the baseline method as shown in the red dashed boxes.