Table of Contents
Fetching ...

GAFusion: Adaptive Fusing LiDAR and Camera with Multiple Guidance for 3D Object Detection

Xiaotian Li, Baojie Fan, Jiandong Tian, Huijie Fan

TL;DR

This work proposes a novel multi-modality 3D objection detection method, named GA-Fusion, with LiDAR-guided global interaction and adaptive fusion, and introduces sparse depth guidance and LiDAR occupancy guidance to generate 3D features with sufficient depth information.

Abstract

Recent years have witnessed the remarkable progress of 3D multi-modality object detection methods based on the Bird's-Eye-View (BEV) perspective. However, most of them overlook the complementary interaction and guidance between LiDAR and camera. In this work, we propose a novel multi-modality 3D objection detection method, named GAFusion, with LiDAR-guided global interaction and adaptive fusion. Specifically, we introduce sparse depth guidance (SDG) and LiDAR occupancy guidance (LOG) to generate 3D features with sufficient depth information. In the following, LiDAR-guided adaptive fusion transformer (LGAFT) is developed to adaptively enhance the interaction of different modal BEV features from a global perspective. Meanwhile, additional downsampling with sparse height compression and multi-scale dual-path transformer (MSDPT) are designed to enlarge the receptive fields of different modal features. Finally, a temporal fusion module is introduced to aggregate features from previous frames. GAFusion achieves state-of-the-art 3D object detection results with 73.6$\%$ mAP and 74.9$\%$ NDS on the nuScenes test set.

GAFusion: Adaptive Fusing LiDAR and Camera with Multiple Guidance for 3D Object Detection

TL;DR

This work proposes a novel multi-modality 3D objection detection method, named GA-Fusion, with LiDAR-guided global interaction and adaptive fusion, and introduces sparse depth guidance and LiDAR occupancy guidance to generate 3D features with sufficient depth information.

Abstract

Recent years have witnessed the remarkable progress of 3D multi-modality object detection methods based on the Bird's-Eye-View (BEV) perspective. However, most of them overlook the complementary interaction and guidance between LiDAR and camera. In this work, we propose a novel multi-modality 3D objection detection method, named GAFusion, with LiDAR-guided global interaction and adaptive fusion. Specifically, we introduce sparse depth guidance (SDG) and LiDAR occupancy guidance (LOG) to generate 3D features with sufficient depth information. In the following, LiDAR-guided adaptive fusion transformer (LGAFT) is developed to adaptively enhance the interaction of different modal BEV features from a global perspective. Meanwhile, additional downsampling with sparse height compression and multi-scale dual-path transformer (MSDPT) are designed to enlarge the receptive fields of different modal features. Finally, a temporal fusion module is introduced to aggregate features from previous frames. GAFusion achieves state-of-the-art 3D object detection results with 73.6 mAP and 74.9 NDS on the nuScenes test set.

Paper Structure

This paper contains 18 sections, 5 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Comparison between BEVFusion and the proposed GAFusion. (a) In BEVFusion, the camera stream and the LiDAR stream separately generate BEV features, which are then concatenated together. (b) In GAFusion, the camera modality BEV features are generated by multiple guidance from the LiDAR stream, and the receptive fields are enhanced by MSDPT. The BEV features are fused by LGAFT. “VT” is view transformer.
  • Figure 2: The overall architecture of GAFusion. The multi-view images and point clouds are fed into the corresponding backbone networks to obtain multi-scale LiDAR features and camera features. For LiDAR guidance, we propose sparse depth guidance (SDG) and LiDAR occupancy guidance (LOG) to guide the 2D camera features by adopting the raw point clouds and LiDAR BEV features, respectively. In addition, we use multi-scale dual-path transformer (MSDPT) to enlarge the receptive fields. Then, LiDAR-guided adaptive fusion transformer (LGAFT) further fuses the two modalities’ BEV features. A temporal fusion module is introduced to aggregate the previous frame’s BEV features, and finally feeds these BEV features into an encoder and a detection head.
  • Figure 3: Additional downsampling and sparse height compression. This operation enlarges the receptive fields of the features and reduces the computational cost.
  • Figure 4: The architecture of sparse depth guidance (SDG) and LiDAR occupancy guidance (LOG). These two modules guide the 2D camera features to generate 3D features that contain sufficient semantic information and accurate depth information.
  • Figure 5: The schema of dual-path transformer (DPT), which effectively aggregates semantic information and expands the receptive fields of the camera features.
  • ...and 3 more figures