Fully Sparse Fusion for 3D Object Detection

Yingyan Li; Lue Fan; Yang Liu; Zehao Huang; Yuntao Chen; Naiyan Wang; Zhaoxiang Zhang

Fully Sparse Fusion for 3D Object Detection

Yingyan Li, Lue Fan, Yang Liu, Zehao Huang, Yuntao Chen, Naiyan Wang, Zhaoxiang Zhang

TL;DR

This work tackles long-range 3D detection by removing dense BEV feature maps and introducing Fully Sparse Fusion (FSF), a fully sparse, instance-level multi-modal detector that combines LiDAR and image information. FSF integrates 2D instance segmentation with 3D instance segmentation via Bi-modal Instance Generation and Bi-modal Instance-based Prediction, augmented by a two-stage assignment strategy to robustly label mixed-modality instances. The approach delivers state-of-the-art results on nuScenes, Waymo Open, and Argoverse 2, with particularly strong gains in long-range and small-object categories and a favorable latency/memory profile. By leveraging instance-level fusion and avoiding dense BEV maps, FSF demonstrates practical benefits for scalable, real-time autonomous driving perception systems.

Abstract

Currently prevalent multimodal 3D detection methods are built upon LiDAR-based detectors that usually use dense Bird's-Eye-View (BEV) feature maps. However, the cost of such BEV feature maps is quadratic to the detection range, making it not suitable for long-range detection. Fully sparse architecture is gaining attention as they are highly efficient in long-range perception. In this paper, we study how to effectively leverage image modality in the emerging fully sparse architecture. Particularly, utilizing instance queries, our framework integrates the well-studied 2D instance segmentation into the LiDAR side, which is parallel to the 3D instance segmentation part in the fully sparse detector. This design achieves a uniform query-based fusion framework in both the 2D and 3D sides while maintaining the fully sparse characteristic. Extensive experiments showcase state-of-the-art results on the widely used nuScenes dataset and the long-range Argoverse 2 dataset. Notably, the inference speed of the proposed method under the long-range LiDAR perception setting is 2.7 $\times$ faster than that of other state-of-the-art multimodal 3D detection methods. Code will be released at \url{https://github.com/BraveGroup/FullySparseFusion}.

Fully Sparse Fusion for 3D Object Detection

TL;DR

Abstract

faster than that of other state-of-the-art multimodal 3D detection methods. Code will be released at \url{https://github.com/BraveGroup/FullySparseFusion}.

Paper Structure (55 sections, 6 equations, 5 figures, 14 tables, 1 algorithm)

This paper contains 55 sections, 6 equations, 5 figures, 14 tables, 1 algorithm.

Introduction
Related Work
Camera-based 3D detection
LiDAR-based 3D detection
Multi-modal 3D detection
Preliminary: Fully Sparse 3D Detector
Points-to-Instances
Instances-to-Boxes
Limitations of single-modal FSD
Methodology
Overall Architecture
Bi-modal Instance Generation
LiDAR Instance Generation
Camera Instance Generation
Bi-modal Instance-based Prediction
...and 40 more sections

Figures (5)

Figure 1: Comparison between dense fusion and sparse fusion. Dense fusion methods rely on dense BEV feature maps. In contrast, our sparse fusion framework fuses two modalities at the instance level, requiring no dense feature maps.
Figure 2: (a): 3D instance segmentation is prone to ignore objects whose points are few. (b): It is hard for 3D segmentation to separate the overlapped objects in a crowded scene. On the contrary, handling these cases via 2D instance segmentation is easier.
Figure 3: The overview of our framework. Our framework is mainly divided into two parts: Bi-modal Instance Generation module in §\ref{['sec:inst_generation']} and Bi-modal Instance-based Prediction module in §\ref{['sec:inst_refine']}. Bi-modal Instance Generation module generates instances from camera and LiDAR modalities. Then, the Bi-modal Instance-based Prediction module aligns the shapes of bi-modal instances and produces the final bounding boxes.
Figure 4: The motivation of two-stage assignment. The center of a camera instance is hard to fall into 3D GT boxes due to noise points as (a) shows. However, it is easy to assign this camera instance to the corresponding GT on the 2D plane as (b) demonstrates.
Figure 5: Qualitative comparison between FSD fsd and FSF. In these scenes, some objects are partially occluded by wire netting, leading to very few LiDAR points. We mark the areas of interest with red circles. With the help of image information, FSF detects more objects than FSD.

Fully Sparse Fusion for 3D Object Detection

TL;DR

Abstract

Fully Sparse Fusion for 3D Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (5)