Table of Contents
Fetching ...

OccTransformer: Improving BEVFormer for 3D camera-only occupancy prediction

Jian Liu, Sipeng Zhang, Chuixin Kong, Wenyuan Zhang, Yuhang Wu, Yikang Ding, Borun Xu, Ruibo Ming, Donglai Wei, Xianming Liu

TL;DR

The paper addresses camera-only 3D occupancy prediction for autonomous driving using multi-camera nuScenes data. It introduces occTransformer, an enhanced BEVFormer-based framework that adds a 3D UNet head, targeted data augmentation, stronger image backbones, and a multi-loss objective, followed by ensembling and detection-guided fusion. The approach achieves a competitive 49.23 miou on the 3D occupancy track, demonstrating the benefits of combining 2D-to-3D lifting, 3D spatial modeling, and cross-model fusion. These results illustrate practical improvements for camera-only 3D scene understanding in autonomous driving, especially through diverse ensembles and StreamPETR-based dynamic-object augmentation.

Abstract

This technical report presents our solution, "occTransformer" for the 3D occupancy prediction track in the autonomous driving challenge at CVPR 2023. Our method builds upon the strong baseline BEVFormer and improves its performance through several simple yet effective techniques. Firstly, we employed data augmentation to increase the diversity of the training data and improve the model's generalization ability. Secondly, we used a strong image backbone to extract more informative features from the input data. Thirdly, we incorporated a 3D unet head to better capture the spatial information of the scene. Fourthly, we added more loss functions to better optimize the model. Additionally, we used an ensemble approach with the occ model BevDet and SurroundOcc to further improve the performance. Most importantly, we integrated 3D detection model StreamPETR to enhance the model's ability to detect objects in the scene. Using these methods, our solution achieved 49.23 miou on the 3D occupancy prediction track in the autonomous driving challenge.

OccTransformer: Improving BEVFormer for 3D camera-only occupancy prediction

TL;DR

The paper addresses camera-only 3D occupancy prediction for autonomous driving using multi-camera nuScenes data. It introduces occTransformer, an enhanced BEVFormer-based framework that adds a 3D UNet head, targeted data augmentation, stronger image backbones, and a multi-loss objective, followed by ensembling and detection-guided fusion. The approach achieves a competitive 49.23 miou on the 3D occupancy track, demonstrating the benefits of combining 2D-to-3D lifting, 3D spatial modeling, and cross-model fusion. These results illustrate practical improvements for camera-only 3D scene understanding in autonomous driving, especially through diverse ensembles and StreamPETR-based dynamic-object augmentation.

Abstract

This technical report presents our solution, "occTransformer" for the 3D occupancy prediction track in the autonomous driving challenge at CVPR 2023. Our method builds upon the strong baseline BEVFormer and improves its performance through several simple yet effective techniques. Firstly, we employed data augmentation to increase the diversity of the training data and improve the model's generalization ability. Secondly, we used a strong image backbone to extract more informative features from the input data. Thirdly, we incorporated a 3D unet head to better capture the spatial information of the scene. Fourthly, we added more loss functions to better optimize the model. Additionally, we used an ensemble approach with the occ model BevDet and SurroundOcc to further improve the performance. Most importantly, we integrated 3D detection model StreamPETR to enhance the model's ability to detect objects in the scene. Using these methods, our solution achieved 49.23 miou on the 3D occupancy prediction track in the autonomous driving challenge.
Paper Structure (11 sections, 1 equation, 2 figures, 4 tables)

This paper contains 11 sections, 1 equation, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Occ results on the testing set of nuScenes Dataset 3D Occupancy prediction track. BBox is 3D detection results
  • Figure 2: The occTransformer framework involves the use of the bevformer method. These 2D features are first extracted and then aggregated into a bird's eye view (BEV) embedding. A simple decoder is then used to generate 3D voxel features, which are further enriched using a 3D U-Net head. The final output of the framework is a 3D occupancy prediction.