Table of Contents
Fetching ...

EdgeRegNet: Edge Feature-based Multimodal Registration Network between Images and LiDAR Point Clouds

Yuanchao Yue, Hui Yuan, Qinglong Miao, Xiaolong Mao, Raouf Hamzaoui, Peter Eisert

TL;DR

The paper tackles cross-modal registration between 2D images and 3D LiDAR point clouds and proposes EdgeRegNet, an edge-feature-driven framework that preserves original data through 2D edge pixels and 3D edge points. It combines separate edge-focused feature extractors with a stacked attention-based feature exchange to align multi-modal features, followed by an optimal matching layer and EPnP with RANSAC to estimate the rigid transform in $\mathrm{SE}(3)$. The loss combines field-of-view consistency, matchability, and per-pair likelihoods to train the network. Experiments on KITTI Odometry and nuScenes demonstrate state-of-the-art accuracy and efficiency, validating edge-based cross-modal registration as a robust solution for autonomous driving and robotics.

Abstract

Cross-modal data registration has long been a critical task in computer vision, with extensive applications in autonomous driving and robotics. Accurate and robust registration methods are essential for aligning data from different modalities, forming the foundation for multimodal sensor data fusion and enhancing perception systems' accuracy and reliability. The registration task between 2D images captured by cameras and 3D point clouds captured by Light Detection and Ranging (LiDAR) sensors is usually treated as a visual pose estimation problem. High-dimensional feature similarities from different modalities are leveraged to identify pixel-point correspondences, followed by pose estimation techniques using least squares methods. However, existing approaches often resort to downsampling the original point cloud and image data due to computational constraints, inevitably leading to a loss in precision. Additionally, high-dimensional features extracted using different feature extractors from various modalities require specific techniques to mitigate cross-modal differences for effective matching. To address these challenges, we propose a method that uses edge information from the original point clouds and images for cross-modal registration. We retain crucial information from the original data by extracting edge points and pixels, enhancing registration accuracy while maintaining computational efficiency. The use of edge points and edge pixels allows us to introduce an attention-based feature exchange block to eliminate cross-modal disparities. Furthermore, we incorporate an optimal matching layer to improve correspondence identification. We validate the accuracy of our method on the KITTI and nuScenes datasets, demonstrating its state-of-the-art performance.

EdgeRegNet: Edge Feature-based Multimodal Registration Network between Images and LiDAR Point Clouds

TL;DR

The paper tackles cross-modal registration between 2D images and 3D LiDAR point clouds and proposes EdgeRegNet, an edge-feature-driven framework that preserves original data through 2D edge pixels and 3D edge points. It combines separate edge-focused feature extractors with a stacked attention-based feature exchange to align multi-modal features, followed by an optimal matching layer and EPnP with RANSAC to estimate the rigid transform in . The loss combines field-of-view consistency, matchability, and per-pair likelihoods to train the network. Experiments on KITTI Odometry and nuScenes demonstrate state-of-the-art accuracy and efficiency, validating edge-based cross-modal registration as a robust solution for autonomous driving and robotics.

Abstract

Cross-modal data registration has long been a critical task in computer vision, with extensive applications in autonomous driving and robotics. Accurate and robust registration methods are essential for aligning data from different modalities, forming the foundation for multimodal sensor data fusion and enhancing perception systems' accuracy and reliability. The registration task between 2D images captured by cameras and 3D point clouds captured by Light Detection and Ranging (LiDAR) sensors is usually treated as a visual pose estimation problem. High-dimensional feature similarities from different modalities are leveraged to identify pixel-point correspondences, followed by pose estimation techniques using least squares methods. However, existing approaches often resort to downsampling the original point cloud and image data due to computational constraints, inevitably leading to a loss in precision. Additionally, high-dimensional features extracted using different feature extractors from various modalities require specific techniques to mitigate cross-modal differences for effective matching. To address these challenges, we propose a method that uses edge information from the original point clouds and images for cross-modal registration. We retain crucial information from the original data by extracting edge points and pixels, enhancing registration accuracy while maintaining computational efficiency. The use of edge points and edge pixels allows us to introduce an attention-based feature exchange block to eliminate cross-modal disparities. Furthermore, we incorporate an optimal matching layer to improve correspondence identification. We validate the accuracy of our method on the KITTI and nuScenes datasets, demonstrating its state-of-the-art performance.

Paper Structure

This paper contains 26 sections, 17 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overall flowchart of our method. In the data pre-processing stage, we obtain the pre-processed point cloud $P_{3D}$ and the image $I_{2D}$, as well as the edge points $kp_{3D}$ of the point cloud and the edges pixels $kp_{2D}$ of the image. After feature extraction, we acquire edge features $\mathbf{d}_{3D}$ and $\mathbf{d}_{2D}$. In the optimal matching process, correspondences between 2D and 3D are found using the partially assigned matrix $\mathbf{P}$. Finally, the transformation matrix $\mathbf{T}$ is estimated using EPnP.
  • Figure 2: The left side of the figure shows the point cloud generated by the LiDAR scan and the spatial coordinate system of the point cloud data. The right side shows the 2D pixels in the camera coordinate system. $\mathbf{T}$ represents a transformation matrix in $\text{SE}(3)$, and $\mathbf{K}$ denotes the camera's intrinsic parameters.
  • Figure 3: Visual comparison of various edge extraction methods on the KITTI Odometry dataset. This is juxtaposed with the visualization of 3D edge point projection images. (a) Edge extraction using the Canny operator with threshold values of (50, 150) yields 63406 edge pixels. (b) Edge extraction using the Sobel operator with threshold values of (0, 150) results in 61176 edge pixels. (c) Edge extraction using the LSD algorithm produces 20933 edge pixels. (d) Visualization of depth-discontinuous points obtained by projecting pre-processed point clouds onto the image plane.
  • Figure 4: Demonstration of the selection of depth-discontinuous points and visualization of their projection onto the image plane. (a) Top view of a single scan line from the point cloud collected by a 64-line LiDAR in the KITTI dataset, showing the LiDAR position and scanning direction, with depth-discontinuous points highlighted. Positions where the point radius decreases are indicated by red arrows, while positions where the radius increases are indicated by green arrows. (b) Visualization of the projection of selected depth-discontinuous points from the point cloud onto the image plane. (c) Visualization of the projection of selected reflectance-discontinuous points from the point cloud onto the image plane.
  • Figure 5: EdgeRegNet Structure. EdgeRegNet consists of two main components: the feature extraction module on the left and the Attention-Based Feature Exchange Block on the right. The feature extraction module includes image feature extraction and point cloud feature extraction. These regions extract high-dimensional features $\mathbf{F}_{2D}$ and $\mathbf{F}_{3D}$ from the preprocessed data, representing 2D and 3D feature points, respectively. The extracted features are then fed into the Attention-Based Feature Exchange Block after position embedding, producing updated features $\mathbf{d}_{2D}$ and $\mathbf{d}_{3D}$. The N in the diagram denotes the consecutive use of N Attention-Based Feature Exchange Blocks.
  • ...and 2 more figures