Table of Contents
Fetching ...

DecoratingFusion: A LiDAR-Camera Fusion Network with the Combination of Point-level and Feature-level Fusion

Zixuan Yin, Han Sun, Ningzhong Liu, Huiyu Zhou, Jiaquan Shen

TL;DR

DecoratingFusion addresses interpretability and alignment challenges in LiDAR–camera fusion for 3D detection by integrating hard calibration-based associations into a two-stage framework that decorates points with image features and initializes object queries via a center heatmap. The approach combines point-level decoration with cross-attention-based feature fusion and embeds class information into queries, enabling end-to-end training and BEV fusion. Extensive experiments on KITTI and Waymo show state-of-the-art performance, particularly for small objects and pedestrians, demonstrating the practical benefits for autonomous driving. The work advances multi-modal fusion by uniting hard point–image associations with soft feature fusion, improving convergence, reliability, and detection accuracy.

Abstract

Lidars and cameras play essential roles in autonomous driving, offering complementary information for 3D detection. The state-of-the-art fusion methods integrate them at the feature level, but they mostly rely on the learned soft association between point clouds and images, which lacks interpretability and neglects the hard association between them. In this paper, we combine feature-level fusion with point-level fusion, using hard association established by the calibration matrices to guide the generation of object queries. Specifically, in the early fusion stage, we use the 2D CNN features of images to decorate the point cloud data, and employ two independent sparse convolutions to extract the decorated point cloud features. In the mid-level fusion stage, we initialize the queries with a center heatmap and embed the predicted class labels as auxiliary information into the queries, making the initial positions closer to the actual centers of the targets. Extensive experiments conducted on two popular datasets, i.e. KITTI, Waymo, demonstrate the superiority of DecoratingFusion.

DecoratingFusion: A LiDAR-Camera Fusion Network with the Combination of Point-level and Feature-level Fusion

TL;DR

DecoratingFusion addresses interpretability and alignment challenges in LiDAR–camera fusion for 3D detection by integrating hard calibration-based associations into a two-stage framework that decorates points with image features and initializes object queries via a center heatmap. The approach combines point-level decoration with cross-attention-based feature fusion and embeds class information into queries, enabling end-to-end training and BEV fusion. Extensive experiments on KITTI and Waymo show state-of-the-art performance, particularly for small objects and pedestrians, demonstrating the practical benefits for autonomous driving. The work advances multi-modal fusion by uniting hard point–image associations with soft feature fusion, improving convergence, reliability, and detection accuracy.

Abstract

Lidars and cameras play essential roles in autonomous driving, offering complementary information for 3D detection. The state-of-the-art fusion methods integrate them at the feature level, but they mostly rely on the learned soft association between point clouds and images, which lacks interpretability and neglects the hard association between them. In this paper, we combine feature-level fusion with point-level fusion, using hard association established by the calibration matrices to guide the generation of object queries. Specifically, in the early fusion stage, we use the 2D CNN features of images to decorate the point cloud data, and employ two independent sparse convolutions to extract the decorated point cloud features. In the mid-level fusion stage, we initialize the queries with a center heatmap and embed the predicted class labels as auxiliary information into the queries, making the initial positions closer to the actual centers of the targets. Extensive experiments conducted on two popular datasets, i.e. KITTI, Waymo, demonstrate the superiority of DecoratingFusion.
Paper Structure (18 sections, 1 equation, 2 figures, 4 tables)

This paper contains 18 sections, 1 equation, 2 figures, 4 tables.

Figures (2)

  • Figure 1: An overview of DecoratingFusion framework.
  • Figure 2: The two independent sparse convolutions used to extract lidar and image feature.