Table of Contents
Fetching ...

SparseLIF: High-Performance Sparse LiDAR-Camera Fusion for 3D Object Detection

Hongcheng Zhang, Liu Liang, Pengxin Zeng, Xiao Song, Zhe Wang

TL;DR

SparseLIF tackles the performance gap between sparse and dense multi-modality 3D detectors by introducing three design pillars: Perspective-Aware Query Generation (PAQG) to produce perspective-informed 3D queries, RoI-Aware Sampling (RIAS) to extract cross-modal RoI features with few reference points, and Uncertainty-Aware Fusion (UAF) to weight modalities by learned uncertainty. Together, these components enable a fully sparse LiDAR-Camera detector that maintains low latency while achieving state-of-the-art accuracy on nuScenes, including robust performance under sensor noise and temporal information integration. The approach defines a practical pathway for high-performance sparse fusion with strong ablations validating each module’s contribution. Overall, SparseLIF demonstrates that carefully crafted query generation, targeted feature sampling, and uncertainty-guided fusion can outperform dense and other sparse methods, offering significant impact for real-time autonomous driving systems.

Abstract

Sparse 3D detectors have received significant attention since the query-based paradigm embraces low latency without explicit dense BEV feature construction. However, these detectors achieve worse performance than their dense counterparts. In this paper, we find the key to bridging the performance gap is to enhance the awareness of rich representations in two modalities. Here, we present a high-performance fully sparse detector for end-to-end multi-modality 3D object detection. The detector, termed SparseLIF, contains three key designs, which are (1) Perspective-Aware Query Generation (PAQG) to generate high-quality 3D queries with perspective priors, (2) RoI-Aware Sampling (RIAS) to further refine prior queries by sampling RoI features from each modality, (3) Uncertainty-Aware Fusion (UAF) to precisely quantify the uncertainty of each sensor modality and adaptively conduct final multi-modality fusion, thus achieving great robustness against sensor noises. By the time of paper submission, SparseLIF achieves state-of-the-art performance on the nuScenes dataset, ranking 1st on both validation set and test benchmark, outperforming all state-of-the-art 3D object detectors by a notable margin.

SparseLIF: High-Performance Sparse LiDAR-Camera Fusion for 3D Object Detection

TL;DR

SparseLIF tackles the performance gap between sparse and dense multi-modality 3D detectors by introducing three design pillars: Perspective-Aware Query Generation (PAQG) to produce perspective-informed 3D queries, RoI-Aware Sampling (RIAS) to extract cross-modal RoI features with few reference points, and Uncertainty-Aware Fusion (UAF) to weight modalities by learned uncertainty. Together, these components enable a fully sparse LiDAR-Camera detector that maintains low latency while achieving state-of-the-art accuracy on nuScenes, including robust performance under sensor noise and temporal information integration. The approach defines a practical pathway for high-performance sparse fusion with strong ablations validating each module’s contribution. Overall, SparseLIF demonstrates that carefully crafted query generation, targeted feature sampling, and uncertainty-guided fusion can outperform dense and other sparse methods, offering significant impact for real-time autonomous driving systems.

Abstract

Sparse 3D detectors have received significant attention since the query-based paradigm embraces low latency without explicit dense BEV feature construction. However, these detectors achieve worse performance than their dense counterparts. In this paper, we find the key to bridging the performance gap is to enhance the awareness of rich representations in two modalities. Here, we present a high-performance fully sparse detector for end-to-end multi-modality 3D object detection. The detector, termed SparseLIF, contains three key designs, which are (1) Perspective-Aware Query Generation (PAQG) to generate high-quality 3D queries with perspective priors, (2) RoI-Aware Sampling (RIAS) to further refine prior queries by sampling RoI features from each modality, (3) Uncertainty-Aware Fusion (UAF) to precisely quantify the uncertainty of each sensor modality and adaptively conduct final multi-modality fusion, thus achieving great robustness against sensor noises. By the time of paper submission, SparseLIF achieves state-of-the-art performance on the nuScenes dataset, ranking 1st on both validation set and test benchmark, outperforming all state-of-the-art 3D object detectors by a notable margin.
Paper Structure (20 sections, 10 equations, 4 figures, 5 tables)

This paper contains 20 sections, 10 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The overall architecture of SparseLIF, a fully sparse LiDAR-camera-based 3D object detector. The framework contains a camera backbone to process multi-view videos and a LiDAR backbone to encode raw point clouds. We then feed the image features into the Perspective-Aware Query Generation (PAQG) module to generate queries. The queries will interact with the camera and LiDAR features via the RoI-Aware Sampling (RIAS) module to extract complementary features for further refinement. Next, the Uncertainty-Aware Fusion (UAF) module quantifies the uncertainty of RoI features from two modalities and adaptively conducts final multi-modality fusion. The decoder repeats $L$ times.
  • Figure 2: Motivations and details of our proposed PAQG module. (a) 3D detectors struggle with low sensitivity when detecting distant and small objects. (b) 2D detectors demonstrate excellent pixel-wise perception capabilities on such objects. (c) the PAQG module adopts the coupled 2D and monocular-3D sub-networks to predict dense boxes under the supervision of a perspective loss. We pick top-ranked boxes to propose high-quality queries, and then interact with camera features via a cross-attention module.
  • Figure 3: Visualizations of sensor noises in 3D object detection for autonomous driving. (a) Limited FOV: LiDAR installed in a front-facing manner yields a limited FOV, e.g.$120^{\circ}$. (b) Object Failure: the reflection rate of some objects (e.g. the black car) is below the threshold of LiDAR thus without LiDAR points reflected. (c) Camera Occlusion: the camera module is usually vulnerable to occlusions (e.g. by dust).
  • Figure 4: Robustness visualizations under the scenario of limited LiDAR FOV angle of $120^{\circ}$. We color each box with green and red for prediction and ground truth.