A Generalized Multi-Modal Fusion Detection Framework
Leichao Cui, Xiuxian Li, Min Meng, Xiaoyu Mo
TL;DR
MMFusion tackles LiDAR sparsity by fusing LiDAR and image features through decoupled streams and a learnable fusion module. It introduces the Voxel Local Perception Module to preserve local voxel information and the Multi-modal Feature Fusion Module to align cross-modal features via attention, enabling adaptive fusion. On KITTI, MMFusion surpasses state-of-the-art baselines, particularly boosting detection of small or occluded objects like cyclists and pedestrians, and demonstrates robustness and generalization. The framework is designed to be modular and extensible to other modalities and tasks in autonomous driving.
Abstract
LiDAR point clouds have become the most common data source in autonomous driving. However, due to the sparsity of point clouds, accurate and reliable detection cannot be achieved in specific scenarios. Because of their complementarity with point clouds, images are getting increasing attention. Although with some success, existing fusion methods either perform hard fusion or do not fuse in a direct manner. In this paper, we propose a generic 3D detection framework called MMFusion, using multi-modal features. The framework aims to achieve accurate fusion between LiDAR and images to improve 3D detection in complex scenes. Our framework consists of two separate streams: the LiDAR stream and the camera stream, which can be compatible with any single-modal feature extraction network. The Voxel Local Perception Module in the LiDAR stream enhances local feature representation, and then the Multi-modal Feature Fusion Module selectively combines feature output from different streams to achieve better fusion. Extensive experiments have shown that our framework not only outperforms existing benchmarks but also improves their detection, especially for detecting cyclists and pedestrians on KITTI benchmarks, with strong robustness and generalization capabilities. Hopefully, our work will stimulate more research into multi-modal fusion for autonomous driving tasks.
