Table of Contents
Fetching ...

DVLO: Deep Visual-LiDAR Odometry with Local-to-Global Feature Fusion and Bi-Directional Structure Alignment

Jiuming Liu, Dong Zhuo, Zhiheng Feng, Siting Zhu, Chensheng Peng, Zhe Liu, Hesheng Wang

TL;DR

DVLO addresses visual-LiDAR odometry by estimating the relative pose $(q,t)$ between frames, where $q \in \mathbb{R}^{4}$ and $t \in \mathbb{R}^{3}$, using a local-to-global fusion with bi-directional structure alignment. It introduces a clustering-based Local Fuser to form fine-grained local correspondences by projecting LiDAR points onto the image plane to form cluster centers and using image-derived pseudo points for image-to-point fusion, and a Global Fuser that projects points to cylindrical pseudo-images for global adaptive fusion. Pose is inferred from a cost-volume embedding and refined iteratively, yielding state-of-the-art accuracy on KITTI odometry and FlyingThings3D and generalizing to scene-flow estimation with real-time performance on GPUs. The approach demonstrates strong cross-modal interaction, efficiency, and potential for broad applicability to multi-modal perception and navigation tasks.

Abstract

Information inside visual and LiDAR data is well complementary derived from the fine-grained texture of images and massive geometric information in point clouds. However, it remains challenging to explore effective visual-LiDAR fusion, mainly due to the intrinsic data structure inconsistency between two modalities: Image pixels are regular and dense, but LiDAR points are unordered and sparse. To address the problem, we propose a local-to-global fusion network (DVLO) with bi-directional structure alignment. To obtain locally fused features, we project points onto the image plane as cluster centers and cluster image pixels around each center. Image pixels are pre-organized as pseudo points for image-to-point structure alignment. Then, we convert points to pseudo images by cylindrical projection (point-to-image structure alignment) and perform adaptive global feature fusion between point features and local fused features. Our method achieves state-of-the-art performance on KITTI odometry and FlyingThings3D scene flow datasets compared to both single-modal and multi-modal methods. Codes are released at https://github.com/IRMVLab/DVLO.

DVLO: Deep Visual-LiDAR Odometry with Local-to-Global Feature Fusion and Bi-Directional Structure Alignment

TL;DR

DVLO addresses visual-LiDAR odometry by estimating the relative pose between frames, where and , using a local-to-global fusion with bi-directional structure alignment. It introduces a clustering-based Local Fuser to form fine-grained local correspondences by projecting LiDAR points onto the image plane to form cluster centers and using image-derived pseudo points for image-to-point fusion, and a Global Fuser that projects points to cylindrical pseudo-images for global adaptive fusion. Pose is inferred from a cost-volume embedding and refined iteratively, yielding state-of-the-art accuracy on KITTI odometry and FlyingThings3D and generalizing to scene-flow estimation with real-time performance on GPUs. The approach demonstrates strong cross-modal interaction, efficiency, and potential for broad applicability to multi-modal perception and navigation tasks.

Abstract

Information inside visual and LiDAR data is well complementary derived from the fine-grained texture of images and massive geometric information in point clouds. However, it remains challenging to explore effective visual-LiDAR fusion, mainly due to the intrinsic data structure inconsistency between two modalities: Image pixels are regular and dense, but LiDAR points are unordered and sparse. To address the problem, we propose a local-to-global fusion network (DVLO) with bi-directional structure alignment. To obtain locally fused features, we project points onto the image plane as cluster centers and cluster image pixels around each center. Image pixels are pre-organized as pseudo points for image-to-point structure alignment. Then, we convert points to pseudo images by cylindrical projection (point-to-image structure alignment) and perform adaptive global feature fusion between point features and local fused features. Our method achieves state-of-the-art performance on KITTI odometry and FlyingThings3D scene flow datasets compared to both single-modal and multi-modal methods. Codes are released at https://github.com/IRMVLab/DVLO.
Paper Structure (18 sections, 8 equations, 7 figures, 7 tables)

This paper contains 18 sections, 8 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Different fusion strategies for images and points. Most previous works only perform the fusion globally valente2019deep or locally zhuoins20234drvo. Our DVLO designs a local-to-global fusion strategy that facilitates the interaction of global information while preserving local fine-grained information. Furthermore, a bi-directional structure alignment is designed to maximize the inter-modality complementarity.
  • Figure 2: The pipeline of our proposed DVLO. We propose a novel Local-to-Global (LoGo) fusion module, which consists of a clustering-based Local Fuser and an adaptive Global Fuser. The pose is initially regressed from the cost volume of the coarsest fused features and then refined iteratively from fused features in shallower layers.
  • Figure 3: Our designed Local-to-Global (LoGo) Fusion module. We project points onto the image plane based on the coordinate system transformation matrix as cluster centers and convert the image into a set of pseudo points. Then, we locally aggregate pseudo point features based on the similarities to each cluster center.
  • Figure 4: Trajectory of our estimated pose. This figure shows both 2D and 3D trajectories of our network and also the ground truth one on the KITTI dataset.
  • Figure 5: Trajectory results of LOAM and ours on the KITTI sequence 07 with ground truth. Our performance is better than LOAM both without and with mapping.
  • ...and 2 more figures