Table of Contents
Fetching ...

InsFusion: Rethink Instance-level LiDAR-Camera Fusion for 3D Object Detection

Zhongyu Xia, Hansong Yang, Yongtao Wang

TL;DR

InsFusion tackles noise and error accumulation in multi-view LiDAR-camera fusion for 3D object detection by introducing an instance-level, query-based refinement that leverages proposals from raw and fused features. The method extracts $K$ proposals from each modality, aligns them into a shared space using modality-specific transformers, and refines them via a deformable transformer that attends to raw image features, raw LiDAR BEV features, and fused BEV features. This multi-source querying anchors refinement to low-noise inputs, improving state-of-the-art baselines on nuScenes with minimal fine-tuning. The approach is broadly compatible with BEV-based fusion architectures and offers practical gains with modest computational overhead, making it attractive for autonomous driving perception stacks.

Abstract

Three-dimensional Object Detection from multi-view cameras and LiDAR is a crucial component for autonomous driving and smart transportation. However, in the process of basic feature extraction, perspective transformation, and feature fusion, noise and error will gradually accumulate. To address this issue, we propose InsFusion, which can extract proposals from both raw and fused features and utilizes these proposals to query the raw features, thereby mitigating the impact of accumulated errors. Additionally, by incorporating attention mechanisms applied to the raw features, it thereby mitigates the impact of accumulated errors. Experiments on the nuScenes dataset demonstrate that InsFusion is compatible with various advanced baseline methods and delivers new state-of-the-art performance for 3D object detection.

InsFusion: Rethink Instance-level LiDAR-Camera Fusion for 3D Object Detection

TL;DR

InsFusion tackles noise and error accumulation in multi-view LiDAR-camera fusion for 3D object detection by introducing an instance-level, query-based refinement that leverages proposals from raw and fused features. The method extracts proposals from each modality, aligns them into a shared space using modality-specific transformers, and refines them via a deformable transformer that attends to raw image features, raw LiDAR BEV features, and fused BEV features. This multi-source querying anchors refinement to low-noise inputs, improving state-of-the-art baselines on nuScenes with minimal fine-tuning. The approach is broadly compatible with BEV-based fusion architectures and offers practical gains with modest computational overhead, making it attractive for autonomous driving perception stacks.

Abstract

Three-dimensional Object Detection from multi-view cameras and LiDAR is a crucial component for autonomous driving and smart transportation. However, in the process of basic feature extraction, perspective transformation, and feature fusion, noise and error will gradually accumulate. To address this issue, we propose InsFusion, which can extract proposals from both raw and fused features and utilizes these proposals to query the raw features, thereby mitigating the impact of accumulated errors. Additionally, by incorporating attention mechanisms applied to the raw features, it thereby mitigates the impact of accumulated errors. Experiments on the nuScenes dataset demonstrate that InsFusion is compatible with various advanced baseline methods and delivers new state-of-the-art performance for 3D object detection.

Paper Structure

This paper contains 10 sections, 1 equation, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Comparison between the existing paradigm and our paradigm.
  • Figure 2: Overview of the InsFusion framework. The framework extracts proposals from raw camera features, LiDAR features, as well as fused BEV features, and then aligns and refines all proposals to predict 3D bounding boxes.