Table of Contents
Fetching ...

BePo: Dual Representation for 3D Occupancy Prediction

Yunxiao Shi, Hong Cai, Jisoo Jeong, Yinhao Zhu, Shizhong Han, Amin Ansari, Fatih Porikli

Abstract

3D occupancy infers fine-grained 3D geometry and semantics which is critical for autonomous driving. Most existing approaches carry high compute costs, requiring dense 3D feature volume and cross-attention to effectively aggregate information. More efficient methods adopt Bird's Eye View (BEV) or sparse points as scene representation leading to much reduced runtime. However, BEV struggles with small objects that often have very limited feature representation especially after being projected to the ground plane. Sparse points on the other and, can model objects of various sizes in 3D space, but is inefficient at capturing flat surfaces or large objects. To address these shortcomings, we present BePo, which features a dual representation of BEV and sparse points. The 3D information learned in the sparse points branch is shared with the BEV stream via cross-attention, which injects learning signals of difficult objects on the BEV plane. The outputs of both branches are then fused to generate the final 3D occupancy predictions. Extensive experiments on a suite of challenging benchmarks including Occ3D-nuScenes, Occ3D-Waymo and Occ-ScanNet demonstrate the superiority of our proposed BePo. In addition, BePo carries low inference cost even when compared to latest efficient methods.

BePo: Dual Representation for 3D Occupancy Prediction

Abstract

3D occupancy infers fine-grained 3D geometry and semantics which is critical for autonomous driving. Most existing approaches carry high compute costs, requiring dense 3D feature volume and cross-attention to effectively aggregate information. More efficient methods adopt Bird's Eye View (BEV) or sparse points as scene representation leading to much reduced runtime. However, BEV struggles with small objects that often have very limited feature representation especially after being projected to the ground plane. Sparse points on the other and, can model objects of various sizes in 3D space, but is inefficient at capturing flat surfaces or large objects. To address these shortcomings, we present BePo, which features a dual representation of BEV and sparse points. The 3D information learned in the sparse points branch is shared with the BEV stream via cross-attention, which injects learning signals of difficult objects on the BEV plane. The outputs of both branches are then fused to generate the final 3D occupancy predictions. Extensive experiments on a suite of challenging benchmarks including Occ3D-nuScenes, Occ3D-Waymo and Occ-ScanNet demonstrate the superiority of our proposed BePo. In addition, BePo carries low inference cost even when compared to latest efficient methods.

Paper Structure

This paper contains 21 sections, 14 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Accuracy (mIoU on Occ3D-nuScenes tian2024occ3dcaesar2020nuscenes) vs. inference latency (ms) measured on a single NVIDIA A100 GPU. BePo outperforms previous methods while maintaining competitive inference latency.
  • Figure 2: Overview of our proposed BePo. First, an image backbone (e.g., ResNet he2016deep) extracts features from the multiple camera images, which are then ingested as input by both the sparse points and BEV branches. Interaction between the features from these two learning streams is enabled via cross-attention. We fuse the volume obtained through voxelization using predicted 3D points locations and class scores from the sparse points branch with the predicted volume from the BEV branch to generate the final predicted 3D occupancy.
  • Figure 3: Example qualitative 3D semantic occupancy prediction of BePo on Occ3D-nuScenes validation set. Cons. Veh stands for "Construction Vehicle" and Dri. Sur stands for "Drivable Surface". Both prediction and ground-truth are visualized under BEV. Best viewed in color and zoomed in.
  • Figure 4: Example qualitative comparison of BePo with yu2023flashocc and wang2024opus on Occ3D-nuScenes validation set. We see that BePo is able to capture small objects, e.g., motorcycle at frame edge, car at far distance, that are represented by only a very limited number of voxels while both other methods failed.
  • Figure 5: Example qualitative 3D semantic occupancy prediction of BePo on Occ3D-Waymo validation set. Both prediction and ground-truth are visualized under BEV. Best viewed in color and zoomed in.
  • ...and 1 more figures