Table of Contents
Fetching ...

A Generalized Multi-Modal Fusion Detection Framework

Leichao Cui, Xiuxian Li, Min Meng, Xiaoyu Mo

TL;DR

MMFusion tackles LiDAR sparsity by fusing LiDAR and image features through decoupled streams and a learnable fusion module. It introduces the Voxel Local Perception Module to preserve local voxel information and the Multi-modal Feature Fusion Module to align cross-modal features via attention, enabling adaptive fusion. On KITTI, MMFusion surpasses state-of-the-art baselines, particularly boosting detection of small or occluded objects like cyclists and pedestrians, and demonstrates robustness and generalization. The framework is designed to be modular and extensible to other modalities and tasks in autonomous driving.

Abstract

LiDAR point clouds have become the most common data source in autonomous driving. However, due to the sparsity of point clouds, accurate and reliable detection cannot be achieved in specific scenarios. Because of their complementarity with point clouds, images are getting increasing attention. Although with some success, existing fusion methods either perform hard fusion or do not fuse in a direct manner. In this paper, we propose a generic 3D detection framework called MMFusion, using multi-modal features. The framework aims to achieve accurate fusion between LiDAR and images to improve 3D detection in complex scenes. Our framework consists of two separate streams: the LiDAR stream and the camera stream, which can be compatible with any single-modal feature extraction network. The Voxel Local Perception Module in the LiDAR stream enhances local feature representation, and then the Multi-modal Feature Fusion Module selectively combines feature output from different streams to achieve better fusion. Extensive experiments have shown that our framework not only outperforms existing benchmarks but also improves their detection, especially for detecting cyclists and pedestrians on KITTI benchmarks, with strong robustness and generalization capabilities. Hopefully, our work will stimulate more research into multi-modal fusion for autonomous driving tasks.

A Generalized Multi-Modal Fusion Detection Framework

TL;DR

MMFusion tackles LiDAR sparsity by fusing LiDAR and image features through decoupled streams and a learnable fusion module. It introduces the Voxel Local Perception Module to preserve local voxel information and the Multi-modal Feature Fusion Module to align cross-modal features via attention, enabling adaptive fusion. On KITTI, MMFusion surpasses state-of-the-art baselines, particularly boosting detection of small or occluded objects like cyclists and pedestrians, and demonstrates robustness and generalization. The framework is designed to be modular and extensible to other modalities and tasks in autonomous driving.

Abstract

LiDAR point clouds have become the most common data source in autonomous driving. However, due to the sparsity of point clouds, accurate and reliable detection cannot be achieved in specific scenarios. Because of their complementarity with point clouds, images are getting increasing attention. Although with some success, existing fusion methods either perform hard fusion or do not fuse in a direct manner. In this paper, we propose a generic 3D detection framework called MMFusion, using multi-modal features. The framework aims to achieve accurate fusion between LiDAR and images to improve 3D detection in complex scenes. Our framework consists of two separate streams: the LiDAR stream and the camera stream, which can be compatible with any single-modal feature extraction network. The Voxel Local Perception Module in the LiDAR stream enhances local feature representation, and then the Multi-modal Feature Fusion Module selectively combines feature output from different streams to achieve better fusion. Extensive experiments have shown that our framework not only outperforms existing benchmarks but also improves their detection, especially for detecting cyclists and pedestrians on KITTI benchmarks, with strong robustness and generalization capabilities. Hopefully, our work will stimulate more research into multi-modal fusion for autonomous driving tasks.
Paper Structure (17 sections, 12 equations, 5 figures, 4 tables)

This paper contains 17 sections, 12 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Multi-sensor fusion methods are divided into three types: result-level, point-level and feature-level fusion. (a) Result-level fusion fuses the output of individual detectors. (b) Point-level fusion projects the point clouds onto the image and acquires the corresponding features for fusion. (c) Feature-level fusion starts by acquiring features from different modalities and then fusing the features.
  • Figure 2: MMFusion consists of separate point clouds and image data streams, a multi-modal fusion module and a 3D object detection head. (a) LiDAR Stream extracts the original point cloud features. (b) Image Stream extracts the RGB images features. (c) Multi-modal Data Fusion Module fuses multi-modal data features from two streams. (d) Fusion features support different tasks (e.g. 3D Object Detection) with task-specific heads.
  • Figure 3: The structure of the Voxel Local Perception Module consists of Point Attention Module and Dynamic Weights Module. (a) Point Attention Module adaptively acquires the features of other points within the same voxel grid. (b) Dynamic Weights Module dynamically obtains the weights of different points of the same voxel to obtain the voxel characteristics.
  • Figure 4: The structure of Multi-modal Feature Fusion Module. (a) It first compresses different modal features to a uniform size. (b) Then the features are mapped to the same feature space. (c) Finally, it uses an element-wise sum to fuse features of the same feature space.
  • Figure 5: The visualization and detection results on the KITTI dataset. We use green, yellow and blue 3D bounding boxes to represent the detection results of cars, cyclists and pedestrians.