Table of Contents
Fetching ...

SF3D-RGB: Scene Flow Estimation from Monocular Camera and Sparse LiDAR

Rajai Alhimdiat, Ramy Battrawy, René Schuster, Didier Stricker, Wesam Ashour

TL;DR

A deep learning architecture, SF3D-RGB, that enables sparse scene flow estimation using 2D monocular images and 3D point clouds as inputs and outperforms single-modality methods and achieves better scene flow accuracy on real-world datasets while using fewer parameters compared to other state-of-the-art methods with fusion.

Abstract

Scene flow estimation is an extremely important task in computer vision to support the perception of dynamic changes in the scene. For robust scene flow, learning-based approaches have recently achieved impressive results using either image-based or LiDAR-based modalities. However, these methods have tended to focus on the use of a single modality. To tackle these problems, we present a deep learning architecture, SF3D-RGB, that enables sparse scene flow estimation using 2D monocular images and 3D point clouds (e.g., acquired by LiDAR) as inputs. Our architecture is an end-to-end model that first encodes information from each modality into features and fuses them together. Then, the fused features enhance a graph matching module for better and more robust mapping matrix computation to generate an initial scene flow. Finally, a residual scene flow module further refines the initial scene flow. Our model is designed to strike a balance between accuracy and efficiency. Furthermore, experiments show that our proposed method outperforms single-modality methods and achieves better scene flow accuracy on real-world datasets while using fewer parameters compared to other state-of-the-art methods with fusion.

SF3D-RGB: Scene Flow Estimation from Monocular Camera and Sparse LiDAR

TL;DR

A deep learning architecture, SF3D-RGB, that enables sparse scene flow estimation using 2D monocular images and 3D point clouds as inputs and outperforms single-modality methods and achieves better scene flow accuracy on real-world datasets while using fewer parameters compared to other state-of-the-art methods with fusion.

Abstract

Scene flow estimation is an extremely important task in computer vision to support the perception of dynamic changes in the scene. For robust scene flow, learning-based approaches have recently achieved impressive results using either image-based or LiDAR-based modalities. However, these methods have tended to focus on the use of a single modality. To tackle these problems, we present a deep learning architecture, SF3D-RGB, that enables sparse scene flow estimation using 2D monocular images and 3D point clouds (e.g., acquired by LiDAR) as inputs. Our architecture is an end-to-end model that first encodes information from each modality into features and fuses them together. Then, the fused features enhance a graph matching module for better and more robust mapping matrix computation to generate an initial scene flow. Finally, a residual scene flow module further refines the initial scene flow. Our model is designed to strike a balance between accuracy and efficiency. Furthermore, experiments show that our proposed method outperforms single-modality methods and achieves better scene flow accuracy on real-world datasets while using fewer parameters compared to other state-of-the-art methods with fusion.
Paper Structure (17 sections, 10 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 10 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: SF3D-RGB consists of a Pointwise Feature Extraction (FE) module, a Feature Pyramid Network (FPN), a Fusion Module (FM), a Graph Matching Module (GM), and a Refinement Module (RF). We denote the feature extraction convolutions by $g(\cdot)$.
  • Figure 2: Pointwise Feature Extraction Module based on graph convolution.
  • Figure 3: Qualitative scene flow results of SF3D-RGB on the KITTId dataset, compared with LiDAR-only and early fusion baselines. LiDAR points are overlaid on the images to improve visualization clarity of the input scenes. Note that FLOT puy20flot uses only LiDAR as input, whereas both the early fusion and our SF3D-RGB use LiDAR combined with a monocular camera as inputs. In the error maps, dark blue indicates lower error, dark red indicates higher error, and black regions indicate unavailable LiDAR points due to removal of ground points.
  • Figure 4: Qualitative comparison of SF3D-RGB and DeepLiDARFlow rishav2020deeplidarflow on the KITTId and lidarKITTI test datasets. Our SF3D-RGB results show strong scene flow accuracy across both datasets compared to DeepLiDARFlow. LiDAR points are overlaid on the images to improve visualization clarity of the input scenes. Note that DeepLiDARFlow and our SF3D-RGB use LiDAR combined with a monocular camera as inputs. In the error maps, dark blue indicates lower error, dark red indicates higher error, and black regions indicate unavailable LiDAR points due to removal of ground points. Note that DeepLiDARFlow does not exclude ground points in its scene flow estimation.