Table of Contents
Fetching ...

PathFusion: Path-consistent Lidar-Camera Deep Feature Fusion

Lemeng Wu, Dilin Wang, Meng Li, Yunyang Xiong, Raghuraman Krishnamoorthi, Qiang Liu, Vikas Chandra

TL;DR

This work tackles the problem of semantic misalignment in deep LiDAR-camera fusion for 3D detection. It introduces PathFusion, a path-consistency regularization that aligns 2D and 3D feature transformations across fusion stages by enforcing consistency between the 2D and 3D pathways starting from the same 2D input, with gradients stopped on the 3D branch. Applied to the Focals Conv baseline, PathFusion yields measurable improvements on KITTI ($AP_{3D}(R11)$ gains) and nuScenes (higher $mAP$ and $NDS$), demonstrating that regularizing feature alignment can unlock deeper fusion benefits. The method is lightweight to integrate and can be applied to existing fusion architectures, offering a practical route to improve multi-modal 3D detection without heavy cross-attention modules.

Abstract

Fusing 3D LiDAR features with 2D camera features is a promising technique for enhancing the accuracy of 3D detection, thanks to their complementary physical properties. While most of the existing methods focus on directly fusing camera features with raw LiDAR point clouds or shallow-level 3D features, it is observed that directly combining 2D and 3D features in deeper layers actually leads to a decrease in accuracy due to feature misalignment. The misalignment, which stems from the aggregation of features learned from large receptive fields, becomes increasingly more severe as we delve into deeper layers. In this paper, we propose PathFusion as a solution to enable the alignment of semantically coherent LiDAR-camera deep feature fusion. PathFusion introduces a path consistency loss at multiple stages within the network, encouraging the 2D backbone and its fusion path to transform 2D features in a way that aligns semantically with the transformation of the 3D backbone. This ensures semantic consistency between 2D and 3D features, even in deeper layers, and amplifies the usage of the network's learning capacity. We apply PathFusion to improve a prior-art fusion baseline, Focals Conv, and observe an improvement of over 1.6% in mAP on the nuScenes test split consistently with and without testing-time data augmentations, and moreover, PathFusion also improves KITTI $\text{AP}_{\text{3D}}$ (R11) by about 0.6% on the moderate level.

PathFusion: Path-consistent Lidar-Camera Deep Feature Fusion

TL;DR

This work tackles the problem of semantic misalignment in deep LiDAR-camera fusion for 3D detection. It introduces PathFusion, a path-consistency regularization that aligns 2D and 3D feature transformations across fusion stages by enforcing consistency between the 2D and 3D pathways starting from the same 2D input, with gradients stopped on the 3D branch. Applied to the Focals Conv baseline, PathFusion yields measurable improvements on KITTI ( gains) and nuScenes (higher and ), demonstrating that regularizing feature alignment can unlock deeper fusion benefits. The method is lightweight to integrate and can be applied to existing fusion architectures, offering a practical route to improve multi-modal 3D detection without heavy cross-attention modules.

Abstract

Fusing 3D LiDAR features with 2D camera features is a promising technique for enhancing the accuracy of 3D detection, thanks to their complementary physical properties. While most of the existing methods focus on directly fusing camera features with raw LiDAR point clouds or shallow-level 3D features, it is observed that directly combining 2D and 3D features in deeper layers actually leads to a decrease in accuracy due to feature misalignment. The misalignment, which stems from the aggregation of features learned from large receptive fields, becomes increasingly more severe as we delve into deeper layers. In this paper, we propose PathFusion as a solution to enable the alignment of semantically coherent LiDAR-camera deep feature fusion. PathFusion introduces a path consistency loss at multiple stages within the network, encouraging the 2D backbone and its fusion path to transform 2D features in a way that aligns semantically with the transformation of the 3D backbone. This ensures semantic consistency between 2D and 3D features, even in deeper layers, and amplifies the usage of the network's learning capacity. We apply PathFusion to improve a prior-art fusion baseline, Focals Conv, and observe an improvement of over 1.6% in mAP on the nuScenes test split consistently with and without testing-time data augmentations, and moreover, PathFusion also improves KITTI (R11) by about 0.6% on the moderate level.
Paper Structure (32 sections, 5 equations, 4 figures, 7 tables)

This paper contains 32 sections, 5 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Overview of three different strategies to fuse the camera and LiDAR features: (a) Shallow fusion accurately fuses the 2D feature with the shallow 3D feature; (b) Deep fusion involves projecting the 2D features into the 3D feature space.; (c) Our method introduces the path consistency loss to mitigate the issue of feature misalignment.
  • Figure 2: A generic 3D detection network with 2D feature fusion at different stages.
  • Figure 3: (a) Illustration of performance degradation with naive deep feature fusion. Results are on the KITTI val split. The baseline setup without feature fusion achieves a 84.93% of $\text{AP}_{\text{3D}}(R11)$. (b) Illustration of our path-consistent loss.
  • Figure 4: An illustration demonstrates the process of lifting features from 2D to 3D at a deeper stage. The upsampling is commonly implemented with a feature pyramid network lin2017feature. The image is sourced from kitti.