Table of Contents
Fetching ...

Injecting Frame-Event Complementary Fusion into Diffusion for Optical Flow in Challenging Scenes

Haonan Wang, Hanyu Zhou, Haoyue Liu, Luxin Yan

TL;DR

This work tackles optical flow estimation in challenging high-speed and low-light scenes where frame-based appearance is rich but boundary information is incomplete, and event streams provide dense boundaries with sparse appearance. It introduces Diff-ABFlow, a diffusion-based framework that fuses frame and event cues through an Attention-ABF module and refines flow via a Multi-Condition Iterative Denoising Decoder (MC-IDD) comprising TVM-MCA and MGDD, effectively modeling the denoising process conditioned on time, visuals, and motion. The approach demonstrates robust performance and strong generalization on synthetic and real degraded datasets, outperforming frame-only, event-only, and prior dual-modal methods, with ablations showing the critical value of the fusion module and the diffusion backbone. The work suggests that combining frame-event complementarity with diffusion-based denoising yields substantial gains in robustness and accuracy, potentially benefiting other perception tasks such as depth estimation and semantic segmentation.

Abstract

Optical flow estimation has achieved promising results in conventional scenes but faces challenges in high-speed and low-light scenes, which suffer from motion blur and insufficient illumination. These conditions lead to weakened texture and amplified noise and deteriorate the appearance saturation and boundary completeness of frame cameras, which are necessary for motion feature matching. In degraded scenes, the frame camera provides dense appearance saturation but sparse boundary completeness due to its long imaging time and low dynamic range. In contrast, the event camera offers sparse appearance saturation, while its short imaging time and high dynamic range gives rise to dense boundary completeness. Traditionally, existing methods utilize feature fusion or domain adaptation to introduce event to improve boundary completeness. However, the appearance features are still deteriorated, which severely affects the mostly adopted discriminative models that learn the mapping from visual features to motion fields and generative models that generate motion fields based on given visual features. So we introduce diffusion models that learn the mapping from noising flow to clear flow, which is not affected by the deteriorated visual features. Therefore, we propose a novel optical flow estimation framework Diff-ABFlow based on diffusion models with frame-event appearance-boundary fusion.

Injecting Frame-Event Complementary Fusion into Diffusion for Optical Flow in Challenging Scenes

TL;DR

This work tackles optical flow estimation in challenging high-speed and low-light scenes where frame-based appearance is rich but boundary information is incomplete, and event streams provide dense boundaries with sparse appearance. It introduces Diff-ABFlow, a diffusion-based framework that fuses frame and event cues through an Attention-ABF module and refines flow via a Multi-Condition Iterative Denoising Decoder (MC-IDD) comprising TVM-MCA and MGDD, effectively modeling the denoising process conditioned on time, visuals, and motion. The approach demonstrates robust performance and strong generalization on synthetic and real degraded datasets, outperforming frame-only, event-only, and prior dual-modal methods, with ablations showing the critical value of the fusion module and the diffusion backbone. The work suggests that combining frame-event complementarity with diffusion-based denoising yields substantial gains in robustness and accuracy, potentially benefiting other perception tasks such as depth estimation and semantic segmentation.

Abstract

Optical flow estimation has achieved promising results in conventional scenes but faces challenges in high-speed and low-light scenes, which suffer from motion blur and insufficient illumination. These conditions lead to weakened texture and amplified noise and deteriorate the appearance saturation and boundary completeness of frame cameras, which are necessary for motion feature matching. In degraded scenes, the frame camera provides dense appearance saturation but sparse boundary completeness due to its long imaging time and low dynamic range. In contrast, the event camera offers sparse appearance saturation, while its short imaging time and high dynamic range gives rise to dense boundary completeness. Traditionally, existing methods utilize feature fusion or domain adaptation to introduce event to improve boundary completeness. However, the appearance features are still deteriorated, which severely affects the mostly adopted discriminative models that learn the mapping from visual features to motion fields and generative models that generate motion fields based on given visual features. So we introduce diffusion models that learn the mapping from noising flow to clear flow, which is not affected by the deteriorated visual features. Therefore, we propose a novel optical flow estimation framework Diff-ABFlow based on diffusion models with frame-event appearance-boundary fusion.

Paper Structure

This paper contains 32 sections, 9 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Illustration of problem and idea. Motion blur in high-speed scenes and insufficient illumination in low-light scenes reduce the boundary completeness of frame images, resulting in unclear boundary in optical flow. In this work, we explore the appearance-boundary complementarity of frame and event to guide the fusion of these two modalities. In addition, we introduce diffusion models to reconstruct the paradigm of optical flow estimation as a denoising process from noisy optical flow to clear optical flow conditioned on fused visual features.
  • Figure 2: Overall framework of Diff-ABFlow. Diff-ABFlow mainly contains two parts: Attention-ABF for feature fusion and MC-IDD for denoising. In Attention-ABF, we utilize the appearance-boundary complementarity to fuse frame and event. In MC-IDD, we first integrate time embedding, visual feature and motion feature in the TVM-MCA module based on multi-way cross-attention mechanism. Then in MGDD, we input the comprehensive feature and the optical flow of the current time step into multiple GRUs with memory slots for iterative denoising. We repeatedly run MC-IDD a certain number of times on the noisy optical flow to obtain the clear optical flow.
  • Figure 3: Appearance-boundary feature distribution of frame and event in high-speed and low-light scenes. We use K-means clustering to analyze the distribution of appearance and boundary features from frame and event features. The frame image has dense appearance saturation but sparse boundary completeness due to the motion blur of high-speed scenes and the insufficient illumination of low-light scenes. On the contrary, the event stream provides complete boundary in such degraded scenes while its appearance saturation is sparse. This motivates us to design a feature fusion module to fuse the two modalities utilizing the appearance-boundary complementarity.
  • Figure 4: t-SNE of visual features and corresponding motion labels from three different models. Obviously, when inputting degraded features into those models, there exist misclassifications in discriminative models and missamplings in traditional generative models, while diffusion models demonstrate strong robustness to degraded inputs. This motivates us to introduce the paradigm of diffusion models and design a denoising decoding module for optical flow estimation.
  • Figure 5: Visualization results on real high-speed and nighttime images of HS-DSEC and LL-DSEC.
  • ...and 2 more figures