Table of Contents
Fetching ...

Bring Event into RGB and LiDAR: Hierarchical Visual-Motion Fusion for Scene Flow

Hanyu Zhou, Yi Chang, Zhiwei Shi, Luxin Yan

TL;DR

Scene flow from RGB and LiDAR is hindered by a modality gap between modalities. The authors introduce an auxiliary event camera as a bridge and propose VisMoFlow, a hierarchical visual-motion fusion framework that fuses cross-modal knowledge in three homogeneous spaces: visual luminance, visual structure, and motion correlation. Key contributions include bridging RGB and LiDAR with an event, explicit homogeneous-space fusion via a cross-attention transformer and self-similarity clustering, and a three-term loss $L_{ ext{pho}}$, $L_{ ext{adv}}$, $L_{ ext{consis}}$, $L_{ ext{pse}}$, and $L^{kl}_{ ext{corr}}$ enabling end-to-end optimization that achieves state-of-the-art results for day and night scenes. The method offers interpretable cross-modal fusion with potential for robust all-day multimodal perception in challenging environments.

Abstract

Single RGB or LiDAR is the mainstream sensor for the challenging scene flow, which relies heavily on visual features to match motion features. Compared with single modality, existing methods adopt a fusion strategy to directly fuse the cross-modal complementary knowledge in motion space. However, these direct fusion methods may suffer the modality gap due to the visual intrinsic heterogeneous nature between RGB and LiDAR, thus deteriorating motion features. We discover that event has the homogeneous nature with RGB and LiDAR in both visual and motion spaces. In this work, we bring the event as a bridge between RGB and LiDAR, and propose a novel hierarchical visual-motion fusion framework for scene flow, which explores a homogeneous space to fuse the cross-modal complementary knowledge for physical interpretation. In visual fusion, we discover that event has a complementarity (relative v.s. absolute) in luminance space with RGB for high dynamic imaging, and has a complementarity (local boundary v.s. global shape) in scene structure space with LiDAR for structure integrity. In motion fusion, we figure out that RGB, event and LiDAR are complementary (spatial-dense, temporal-dense v.s. spatiotemporal-sparse) to each other in correlation space, which motivates us to fuse their motion correlations for motion continuity. The proposed hierarchical fusion can explicitly fuse the multimodal knowledge to progressively improve scene flow from visual space to motion space. Extensive experiments have been performed to verify the superiority of the proposed method.

Bring Event into RGB and LiDAR: Hierarchical Visual-Motion Fusion for Scene Flow

TL;DR

Scene flow from RGB and LiDAR is hindered by a modality gap between modalities. The authors introduce an auxiliary event camera as a bridge and propose VisMoFlow, a hierarchical visual-motion fusion framework that fuses cross-modal knowledge in three homogeneous spaces: visual luminance, visual structure, and motion correlation. Key contributions include bridging RGB and LiDAR with an event, explicit homogeneous-space fusion via a cross-attention transformer and self-similarity clustering, and a three-term loss , , , , and enabling end-to-end optimization that achieves state-of-the-art results for day and night scenes. The method offers interpretable cross-modal fusion with potential for robust all-day multimodal perception in challenging environments.

Abstract

Single RGB or LiDAR is the mainstream sensor for the challenging scene flow, which relies heavily on visual features to match motion features. Compared with single modality, existing methods adopt a fusion strategy to directly fuse the cross-modal complementary knowledge in motion space. However, these direct fusion methods may suffer the modality gap due to the visual intrinsic heterogeneous nature between RGB and LiDAR, thus deteriorating motion features. We discover that event has the homogeneous nature with RGB and LiDAR in both visual and motion spaces. In this work, we bring the event as a bridge between RGB and LiDAR, and propose a novel hierarchical visual-motion fusion framework for scene flow, which explores a homogeneous space to fuse the cross-modal complementary knowledge for physical interpretation. In visual fusion, we discover that event has a complementarity (relative v.s. absolute) in luminance space with RGB for high dynamic imaging, and has a complementarity (local boundary v.s. global shape) in scene structure space with LiDAR for structure integrity. In motion fusion, we figure out that RGB, event and LiDAR are complementary (spatial-dense, temporal-dense v.s. spatiotemporal-sparse) to each other in correlation space, which motivates us to fuse their motion correlations for motion continuity. The proposed hierarchical fusion can explicitly fuse the multimodal knowledge to progressively improve scene flow from visual space to motion space. Extensive experiments have been performed to verify the superiority of the proposed method.
Paper Structure (14 sections, 14 equations, 9 figures, 5 tables)

This paper contains 14 sections, 14 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Illustration of the main idea. There exists a large modality gap due to the visual intrinsic heterogeneous nature between RGB and LiDAR, thus deteriorating the motion features. We discover that the event has the homogeneous nature with RGB and LiDAR in both visual and motion spaces. In this work, we bring the event as a bridge between RGB and LiDAR, and propose a novel hierarchical visual-motion fusion framework for scene flow, which explores a homogeneous feature space to explicitly fuse the cross-modal complementary knowledge for physical interpretation.
  • Figure 2: The architecture of the VisMoFlow mainly contains visual luminance fusion, visual structure fusion and motion correlation fusion. In visual luminance fusion, we fuse the relative luminance of event and the absolute luminance of RGB for high dynamic imaging. In visual structure fusion, we fuse the local boundary structure of event and the global shape structure of LiDAR for structure integrity. In motion correlation fusion, we fuse the spatiotemporal complementary correlation knowledge of the three modalities for 3D motion continuity.
  • Figure 3: Clustering feature distribution of event and LiDAR. Event and LiDAR share the same structure manifold, where the boundary distribution of event is continuous while the shape distribution of LiDAR is truncated. This motivates us to take the structure as a homogeneous space to fuse the boundary-shape knowledge.
  • Figure 4: Correlation distributions of RGB, event and LiDAR. The three modalities have similar distributions in x, y-axis correlation, with z-axis correlation unique to LiDAR. RGB correlation is spatially dense, event is temporally dense, while LiDAR is spatiotemporally sparse. This inspires us to build the homogeneous correlation space to fuse the complementary motion knowledge.
  • Figure 5: Visual comparison of scene flows on synthetic Event-KITTI dataset.
  • ...and 4 more figures