Table of Contents
Fetching ...

GraphAlign: Enhancing Accurate Feature Alignment by Graph matching for Multi-Modal 3D Object Detection

Ziying Song, Haiyue Wei, Lin Bai, Lei Yang, Caiyan Jia

TL;DR

GraphAlign tackles misalignment in multi-modal 3D object detection by introducing graph-based feature alignment and a self-attention refinement stage. It fuses LiDAR point-cloud depth features with image depth features through projection-informed neighbor graphs and one-to-many fusion, then uses SAFA to reweight salient relations. Experiments on KITTI and nuScenes show state-of-the-art or competitive performance, with notable gains on long-range small objects and reduced computation compared with full cross-modal attention. The approach offers a practical, scalable solution for robust cross-modal fusion in autonomous driving.

Abstract

LiDAR and cameras are complementary sensors for 3D object detection in autonomous driving. However, it is challenging to explore the unnatural interaction between point clouds and images, and the critical factor is how to conduct feature alignment of heterogeneous modalities. Currently, many methods achieve feature alignment by projection calibration only, without considering the problem of coordinate conversion accuracy errors between sensors, leading to sub-optimal performance. In this paper, we present GraphAlign, a more accurate feature alignment strategy for 3D object detection by graph matching. Specifically, we fuse image features from a semantic segmentation encoder in the image branch and point cloud features from a 3D Sparse CNN in the LiDAR branch. To save computation, we construct the nearest neighbor relationship by calculating Euclidean distance within the subspaces that are divided into the point cloud features. Through the projection calibration between the image and point cloud, we project the nearest neighbors of point cloud features onto the image features. Then by matching the nearest neighbors with a single point cloud to multiple images, we search for a more appropriate feature alignment. In addition, we provide a self-attention module to enhance the weights of significant relations to fine-tune the feature alignment between heterogeneous modalities. Extensive experiments on nuScenes benchmark demonstrate the effectiveness and efficiency of our GraphAlign.

GraphAlign: Enhancing Accurate Feature Alignment by Graph matching for Multi-Modal 3D Object Detection

TL;DR

GraphAlign tackles misalignment in multi-modal 3D object detection by introducing graph-based feature alignment and a self-attention refinement stage. It fuses LiDAR point-cloud depth features with image depth features through projection-informed neighbor graphs and one-to-many fusion, then uses SAFA to reweight salient relations. Experiments on KITTI and nuScenes show state-of-the-art or competitive performance, with notable gains on long-range small objects and reduced computation compared with full cross-modal attention. The approach offers a practical, scalable solution for robust cross-modal fusion in autonomous driving.

Abstract

LiDAR and cameras are complementary sensors for 3D object detection in autonomous driving. However, it is challenging to explore the unnatural interaction between point clouds and images, and the critical factor is how to conduct feature alignment of heterogeneous modalities. Currently, many methods achieve feature alignment by projection calibration only, without considering the problem of coordinate conversion accuracy errors between sensors, leading to sub-optimal performance. In this paper, we present GraphAlign, a more accurate feature alignment strategy for 3D object detection by graph matching. Specifically, we fuse image features from a semantic segmentation encoder in the image branch and point cloud features from a 3D Sparse CNN in the LiDAR branch. To save computation, we construct the nearest neighbor relationship by calculating Euclidean distance within the subspaces that are divided into the point cloud features. Through the projection calibration between the image and point cloud, we project the nearest neighbors of point cloud features onto the image features. Then by matching the nearest neighbors with a single point cloud to multiple images, we search for a more appropriate feature alignment. In addition, we provide a self-attention module to enhance the weights of significant relations to fine-tune the feature alignment between heterogeneous modalities. Extensive experiments on nuScenes benchmark demonstrate the effectiveness and efficiency of our GraphAlign.
Paper Structure (23 sections, 3 equations, 4 figures, 9 tables, 2 algorithms)

This paper contains 23 sections, 3 equations, 4 figures, 9 tables, 2 algorithms.

Figures (4)

  • Figure 1: Comparison of feature alignment strategies: (a) Projection-based quickly establishes the relationship between modal features but may suffer from misalignment due to sensor error. (b) Attention-based preserves semantic information by learning alignment but has a high computational cost. (c) Our proposed GraphAlign uses graph-based feature alignment to match more plausible alignments between modalities with reduced computation and improved accuracy.
  • Figure 2: The framework of GraphAlign. It consists of the Graph Feature Alignment (GFA) module and the Self-Attention Feature Alignment (SAFA) module. The GFA module takes image and point cloud features as input, uses projection calibration matrix to convert 3D positions to 2D pixel positions, constructs local neighborhood information to find nearest neighbors, and combines image and point cloud features. The SAFA module models the contextual relationships among K nearest neighbors through self-attention mechanism, thereby enhancing the importance of fused features, ultimately selecting the most representative features.
  • Figure 3: GFA Process Flow. (a) sensor accuracy errors lead to misalignment. (b) GFA builds neighbor relationships through graphs in the point cloud feature. (c) We project the point cloud features onto the image features and obtain the K nearest neighbors of the image features. (d) We perform one-to-many fusion, specifically, by fusing each individual point cloud feature with K neighboring image features to achieve a better alignment.
  • Figure 4: SAFA module flow. The head and max modules are simplified here, and the SAFA module aims to enhance the expression of fusion features by improving the global context information between the K neighborhoods.