Table of Contents
Fetching ...

Bridge Frame and Event: Common Spatiotemporal Fusion for High-Dynamic Scene Optical Flow

Hanyu Zhou, Haonan Wang, Haoyue Liu, Yuxing Duan, Yi Chang, Luxin Yan

TL;DR

This work tackles high-dynamic scene optical flow, where frame-based measurements suffer spatial blur and temporal discontinuity due to large displacements. It introduces ComST-Flow, which builds a common spatiotemporal gradient space as an intermediate bridge to align frame and event modalities and then performs boundary-guided motion fusion via two modules: visual boundary localization and motion correlation fusion, enhanced by a cross-modal transformer. The method explicitly enforces cross-modal alignment through losses that couple gradient similarity and boundary templates, and it demonstrates improved dense and continuous flow on synthetic Event-KITTI and real DSEC data, with ablations validating the contribution of each component. The approach offers interpretable, robust fusion for high-dynamic scenes and provides a pixel-aligned frame-event dataset for generalization studies, with potential impact on autonomous navigation and robotics applications where dynamic environments are common.

Abstract

High-dynamic scene optical flow is a challenging task, which suffers spatial blur and temporal discontinuous motion due to large displacement in frame imaging, thus deteriorating the spatiotemporal feature of optical flow. Typically, existing methods mainly introduce event camera to directly fuse the spatiotemporal features between the two modalities. However, this direct fusion is ineffective, since there exists a large gap due to the heterogeneous data representation between frame and event modalities. To address this issue, we explore a common-latent space as an intermediate bridge to mitigate the modality gap. In this work, we propose a novel common spatiotemporal fusion between frame and event modalities for high-dynamic scene optical flow, including visual boundary localization and motion correlation fusion. Specifically, in visual boundary localization, we figure out that frame and event share the similar spatiotemporal gradients, whose similarity distribution is consistent with the extracted boundary distribution. This motivates us to design the common spatiotemporal gradient to constrain the reference boundary localization. In motion correlation fusion, we discover that the frame-based motion possesses spatially dense but temporally discontinuous correlation, while the event-based motion has spatially sparse but temporally continuous correlation. This inspires us to use the reference boundary to guide the complementary motion knowledge fusion between the two modalities. Moreover, common spatiotemporal fusion can not only relieve the cross-modal feature discrepancy, but also make the fusion process interpretable for dense and continuous optical flow. Extensive experiments have been performed to verify the superiority of the proposed method.

Bridge Frame and Event: Common Spatiotemporal Fusion for High-Dynamic Scene Optical Flow

TL;DR

This work tackles high-dynamic scene optical flow, where frame-based measurements suffer spatial blur and temporal discontinuity due to large displacements. It introduces ComST-Flow, which builds a common spatiotemporal gradient space as an intermediate bridge to align frame and event modalities and then performs boundary-guided motion fusion via two modules: visual boundary localization and motion correlation fusion, enhanced by a cross-modal transformer. The method explicitly enforces cross-modal alignment through losses that couple gradient similarity and boundary templates, and it demonstrates improved dense and continuous flow on synthetic Event-KITTI and real DSEC data, with ablations validating the contribution of each component. The approach offers interpretable, robust fusion for high-dynamic scenes and provides a pixel-aligned frame-event dataset for generalization studies, with potential impact on autonomous navigation and robotics applications where dynamic environments are common.

Abstract

High-dynamic scene optical flow is a challenging task, which suffers spatial blur and temporal discontinuous motion due to large displacement in frame imaging, thus deteriorating the spatiotemporal feature of optical flow. Typically, existing methods mainly introduce event camera to directly fuse the spatiotemporal features between the two modalities. However, this direct fusion is ineffective, since there exists a large gap due to the heterogeneous data representation between frame and event modalities. To address this issue, we explore a common-latent space as an intermediate bridge to mitigate the modality gap. In this work, we propose a novel common spatiotemporal fusion between frame and event modalities for high-dynamic scene optical flow, including visual boundary localization and motion correlation fusion. Specifically, in visual boundary localization, we figure out that frame and event share the similar spatiotemporal gradients, whose similarity distribution is consistent with the extracted boundary distribution. This motivates us to design the common spatiotemporal gradient to constrain the reference boundary localization. In motion correlation fusion, we discover that the frame-based motion possesses spatially dense but temporally discontinuous correlation, while the event-based motion has spatially sparse but temporally continuous correlation. This inspires us to use the reference boundary to guide the complementary motion knowledge fusion between the two modalities. Moreover, common spatiotemporal fusion can not only relieve the cross-modal feature discrepancy, but also make the fusion process interpretable for dense and continuous optical flow. Extensive experiments have been performed to verify the superiority of the proposed method.

Paper Structure

This paper contains 13 sections, 11 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Illustration of problem and idea. High-dynamic target in world space cause degraded motion with spatial blur and temporal discontinuity in image space, resulting in the discretization of spatiotemporal features in feature space. Direct spatiotemporal fusion introduces event camera to assist frame camera, and directly fuses their spatiotemporal features. However, this direct fusion suffers the feature misalignment, since there exists a large gap due to the heterogeneous data representation between frame and event modalities. In this work, we explore the common space to bridge the gap, thus guiding the frame-event spatiotemporal feature fusion.
  • Figure 2: The architecture of the ComST-Flow contains visual boundary localization and motion correlation fusion. In visual boundary localization, we transform frame images and event stream into common spatiotemporal gradient space. We further constrain the gradient similarity and the extracted boundary similarity between the two modalities, locating the reference boundary points as the template. In motion correlation fusion, we introduce cross-modal transformer to fuse the spatially dense correlation from frame modality and the temporally continuous correlation from event modality under the guidance of the boundary template, thus achieving dense and continuous optical flow.
  • Figure 3: Similarities of the spatiotemporal gradient and boundary between frame and event modalities. We use Euclidean distance to calculate the frame-event spatiotemporal gradient similarity and the boundary similarity. The distributions of the two similarities are consistent, which motivates us to take spatiotemporal gradient as the common space to constrain the localization of boundary points.
  • Figure 4: Correlation distribution of frame and event. Frame-based correlation features are x, y-axis spatially dense, while event-based correlation features are t-axis temporally continuous. This inspires us to fuse the complementary spatiotemporal correlation between the two modalities for dense and continuous optical flow.
  • Figure 5: Visual comparison of optical flows on real DSEC dataset with slow and fast motion.
  • ...and 3 more figures