Table of Contents
Fetching ...

Learning Optical Flow and Scene Flow with Bidirectional Camera-LiDAR Fusion

Haisong Liu, Tao Lu, Yihui Xu, Jia Liu, Limin Wang

TL;DR

This work tackles the joint estimation of 2D optical flow and 3D scene flow from synchronized camera-LiDAR data. It introduces a bidirectional, multi-stage fusion framework powered by the learnable Bi-CLFM, with two instantiations: CamLiPWC (pyramidal coarse-to-fine) and CamLiRAFT (recurrent all-pairs), including a point-based LiDAR branch for geometric fidelity. The approach achieves state-of-the-art results on FlyingThings3D and KITTI, including a KITTI SF-all of 4.26% without rigid priors, and demonstrates strong generalization to non-rigid motion as shown on Sintel, while maintaining lower parameter counts and competitive latency. These findings highlight the value of end-to-end, bidirectional fusion for exploiting complementary cues from dense images and sparse LiDAR data in motion estimation.

Abstract

In this paper, we study the problem of jointly estimating the optical flow and scene flow from synchronized 2D and 3D data. Previous methods either employ a complex pipeline that splits the joint task into independent stages, or fuse 2D and 3D information in an ``early-fusion'' or ``late-fusion'' manner. Such one-size-fits-all approaches suffer from a dilemma of failing to fully utilize the characteristic of each modality or to maximize the inter-modality complementarity. To address the problem, we propose a novel end-to-end framework, which consists of 2D and 3D branches with multiple bidirectional fusion connections between them in specific layers. Different from previous work, we apply a point-based 3D branch to extract the LiDAR features, as it preserves the geometric structure of point clouds. To fuse dense image features and sparse point features, we propose a learnable operator named bidirectional camera-LiDAR fusion module (Bi-CLFM). We instantiate two types of the bidirectional fusion pipeline, one based on the pyramidal coarse-to-fine architecture (dubbed CamLiPWC), and the other one based on the recurrent all-pairs field transforms (dubbed CamLiRAFT). On FlyingThings3D, both CamLiPWC and CamLiRAFT surpass all existing methods and achieve up to a 47.9\% reduction in 3D end-point-error from the best published result. Our best-performing model, CamLiRAFT, achieves an error of 4.26\% on the KITTI Scene Flow benchmark, ranking 1st among all submissions with much fewer parameters. Besides, our methods have strong generalization performance and the ability to handle non-rigid motion. Code is available at https://github.com/MCG-NJU/CamLiFlow.

Learning Optical Flow and Scene Flow with Bidirectional Camera-LiDAR Fusion

TL;DR

This work tackles the joint estimation of 2D optical flow and 3D scene flow from synchronized camera-LiDAR data. It introduces a bidirectional, multi-stage fusion framework powered by the learnable Bi-CLFM, with two instantiations: CamLiPWC (pyramidal coarse-to-fine) and CamLiRAFT (recurrent all-pairs), including a point-based LiDAR branch for geometric fidelity. The approach achieves state-of-the-art results on FlyingThings3D and KITTI, including a KITTI SF-all of 4.26% without rigid priors, and demonstrates strong generalization to non-rigid motion as shown on Sintel, while maintaining lower parameter counts and competitive latency. These findings highlight the value of end-to-end, bidirectional fusion for exploiting complementary cues from dense images and sparse LiDAR data in motion estimation.

Abstract

In this paper, we study the problem of jointly estimating the optical flow and scene flow from synchronized 2D and 3D data. Previous methods either employ a complex pipeline that splits the joint task into independent stages, or fuse 2D and 3D information in an ``early-fusion'' or ``late-fusion'' manner. Such one-size-fits-all approaches suffer from a dilemma of failing to fully utilize the characteristic of each modality or to maximize the inter-modality complementarity. To address the problem, we propose a novel end-to-end framework, which consists of 2D and 3D branches with multiple bidirectional fusion connections between them in specific layers. Different from previous work, we apply a point-based 3D branch to extract the LiDAR features, as it preserves the geometric structure of point clouds. To fuse dense image features and sparse point features, we propose a learnable operator named bidirectional camera-LiDAR fusion module (Bi-CLFM). We instantiate two types of the bidirectional fusion pipeline, one based on the pyramidal coarse-to-fine architecture (dubbed CamLiPWC), and the other one based on the recurrent all-pairs field transforms (dubbed CamLiRAFT). On FlyingThings3D, both CamLiPWC and CamLiRAFT surpass all existing methods and achieve up to a 47.9\% reduction in 3D end-point-error from the best published result. Our best-performing model, CamLiRAFT, achieves an error of 4.26\% on the KITTI Scene Flow benchmark, ranking 1st among all submissions with much fewer parameters. Besides, our methods have strong generalization performance and the ability to handle non-rigid motion. Code is available at https://github.com/MCG-NJU/CamLiFlow.
Paper Structure (22 sections, 14 equations, 13 figures, 13 tables)

This paper contains 22 sections, 14 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: Leaderboard of the KITTI Scene Flow Benchmark. Marker size indicates model size. Models with unknown sizes and conventional approaches are marked as hollow. Our method outperforms all existing methods menze2015osfren2017ssfvogel2015prsmbehl2017isfma2019drisfjiang2019senseyang2020opticalexpteed2021raft3dyang2021rigidmask with much fewer parameters.
  • Figure 2: Architectures for feature-level fusion. Different from previous methods which adopt an early/late fusion manner, we propose a multi-stage and bidirectional fusion pipeline.
  • Figure 3: Details of Bidirectional Camera-LiDAR Fusion Module (Bi-CLFM). Features from two different modalities are fused in a bidirectional way, so that both modalities can benefit each other. We detach the gradient from one branch to the other to prevent one modality from dominating.
  • Figure 4: Details of Learnable Interpolation. For each target pixel, we find the $k$ nearest points around it. A lightweight MLP followed by a sigmoid activation is used to weigh the neighboring features.
  • Figure 5: The loss and gradient scale of the two branches in CamLiPWC. The 2D gradients are $\sim$40x larger than 3D gradients and the gap does not shrink as the number of training epochs increases.
  • ...and 8 more figures