OccFusion: Depth Estimation Free Multi-sensor Fusion for 3D Occupancy Prediction
Ji Zhang, Yiran Ding, Zixin Liu
TL;DR
OccFusion addresses the robustness and efficiency gaps in 3D occupancy prediction by removing depth estimation from image feature fusion and leveraging a point-to-point LiDAR-camera fusion via deformable attention. It introduces an active coarse-to-fine decoder and a simple, transferable active training strategy to focus learning on hard examples, reducing computation without sacrificing accuracy. Experiments on nuScenes-Occupancy and nuScenes-Occ3D show state-of-the-art or competitive mIoU, with especially large gains for small objects and substantial reductions in computation. The approach offers a practical, generalizable framework for depth-free multi-modal fusion in autonomous driving perception.
Abstract
3D occupancy prediction based on multi-sensor fusion,crucial for a reliable autonomous driving system, enables fine-grained understanding of 3D scenes. Previous fusion-based 3D occupancy predictions relied on depth estimation for processing 2D image features. However, depth estimation is an ill-posed problem, hindering the accuracy and robustness of these methods. Furthermore, fine-grained occupancy prediction demands extensive computational resources. To address these issues, we propose OccFusion, a depth estimation free multi-modal fusion framework. Additionally, we introduce a generalizable active training method and an active decoder that can be applied to any occupancy prediction model, with the potential to enhance their performance. Experiments conducted on nuScenes-Occupancy and nuScenes-Occ3D demonstrate our framework's superior performance. Detailed ablation studies highlight the effectiveness of each proposed method.
