Table of Contents
Fetching ...

OccFusion: Multi-Sensor Fusion Framework for 3D Semantic Occupancy Prediction

Zhenxing Ming, Julie Stephany Berrio, Mao Shan, Stewart Worrall

TL;DR

OccFusion presents a multi-sensor fusion framework that combines surround-view cameras, lidar, and radar to predict dense 3D semantic occupancy. It fuses 2D and 3D features through dynamic fusion modules and global-local attention to generate multi-scale occupancy volumes, achieving robust improvements over camera-only baselines, especially in night and rainy conditions. Experiments on nuScenes and SemanticKITTI demonstrate substantial mIoU gains from multi-sensor fusion, with radar and lidar contributing complementary strengths in range and geometry, respectively. The framework shows favorable convergence behavior, though with higher computational costs, and provides insights into sensor contribution across perception ranges. Overall, OccFusion advances robust 3D scene understanding for autonomous driving by integrating heterogeneous sensing modalities with principled fusion strategies.

Abstract

A comprehensive understanding of 3D scenes is crucial in autonomous vehicles (AVs), and recent models for 3D semantic occupancy prediction have successfully addressed the challenge of describing real-world objects with varied shapes and classes. However, existing methods for 3D occupancy prediction heavily rely on surround-view camera images, making them susceptible to changes in lighting and weather conditions. This paper introduces OccFusion, a novel sensor fusion framework for predicting 3D occupancy. By integrating features from additional sensors, such as lidar and surround view radars, our framework enhances the accuracy and robustness of occupancy prediction, resulting in top-tier performance on the nuScenes benchmark. Furthermore, extensive experiments conducted on the nuScenes and semanticKITTI dataset, including challenging night and rainy scenarios, confirm the superior performance of our sensor fusion strategy across various perception ranges. The code for this framework will be made available at https://github.com/DanielMing123/OccFusion.

OccFusion: Multi-Sensor Fusion Framework for 3D Semantic Occupancy Prediction

TL;DR

OccFusion presents a multi-sensor fusion framework that combines surround-view cameras, lidar, and radar to predict dense 3D semantic occupancy. It fuses 2D and 3D features through dynamic fusion modules and global-local attention to generate multi-scale occupancy volumes, achieving robust improvements over camera-only baselines, especially in night and rainy conditions. Experiments on nuScenes and SemanticKITTI demonstrate substantial mIoU gains from multi-sensor fusion, with radar and lidar contributing complementary strengths in range and geometry, respectively. The framework shows favorable convergence behavior, though with higher computational costs, and provides insights into sensor contribution across perception ranges. Overall, OccFusion advances robust 3D scene understanding for autonomous driving by integrating heterogeneous sensing modalities with principled fusion strategies.

Abstract

A comprehensive understanding of 3D scenes is crucial in autonomous vehicles (AVs), and recent models for 3D semantic occupancy prediction have successfully addressed the challenge of describing real-world objects with varied shapes and classes. However, existing methods for 3D occupancy prediction heavily rely on surround-view camera images, making them susceptible to changes in lighting and weather conditions. This paper introduces OccFusion, a novel sensor fusion framework for predicting 3D occupancy. By integrating features from additional sensors, such as lidar and surround view radars, our framework enhances the accuracy and robustness of occupancy prediction, resulting in top-tier performance on the nuScenes benchmark. Furthermore, extensive experiments conducted on the nuScenes and semanticKITTI dataset, including challenging night and rainy scenarios, confirm the superior performance of our sensor fusion strategy across various perception ranges. The code for this framework will be made available at https://github.com/DanielMing123/OccFusion.
Paper Structure (29 sections, 4 equations, 7 figures, 8 tables)

This paper contains 29 sections, 4 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Pipeline for two approaches: purely vision-centric approach (top) and multi-sensor fusion approach (bottom). We conduct 3D semantic occupancy prediction by doing feature fusion with respect to three modality feature volumes.
  • Figure 2: Overall architecture of OccFusion. Firstly, the surround-view images were inputted into the 2D backbone to extract multiple-scale features. Subsequently, each scale's view transformation is conducted to obtain each level's global BEV feature and the local 3D feature volume. The 3D point cloud generated by the lidar and surround-view radars is also inputted into the 3D backbone to generate multi-scale local 3D feature volumes and global BEV features. The dynamic fusion 3D/2D modules at each level fuse features from the cameras and lidar/radar. Following this, each level's merged global BEV feature and local 3D feature volume are fed into the global-local attention fusion to generate the final 3D volume at each scale. Finally, the 3D volume at each level is upsampled, and the skip connection is performed while adopting a multi-scale supervision mechanism.
  • Figure 3: Dynamic Fusion 3D/2D Modules. The upper diagram exhibits the process details of the dynamic fusion 2D module, and the bottom diagram shows the process details of the dynamic fusion 3D module.
  • Figure 4: Class distribution for three validation sets. (a) whole validation set class distribution, (b) rainy scenario subset class distribution, and (c) night scenario subset class distribution.
  • Figure 5: Performance variation trend for 3D semantic occupancy prediction task. (a) mIoU performance variation trend on the whole nuScenes validation set, (b) mIoU performance variation trend on the nuScenes validation rainy scenario subset, and (c) mIoU performance variation on the nuScenes validation night scenario subset.Better viewed when zoomed in.
  • ...and 2 more figures