Table of Contents
Fetching ...

CMTA: Cross-Modal Temporal Alignment for Event-guided Video Deblurring

Taewoo Kim, Hoonhee Cho, Kuk-Jin Yoon

TL;DR

CMTA tackles motion blur in video by leveraging an event camera stream $\{\mathbb{E}_{k}\}$ and a sequence of blurred frames $\{B_{k}\}$ to estimate the latent sharp frame $S_t$. Two core modules, Cross-modal Recurrent Intra-frame Feature Enhancement (CRIFE) and Event-guided Cascaded Inter-frame Temporal Feature Alignment (ECITFA), fuse features via cross-modal attention and perform cascaded temporal alignment across multiple scales. The authors introduce the EVRB dataset, a real-world collection of blurred RGB videos, corresponding sharp frames, and aligned event data captured with a triple-axis hybrid camera system. Empirical results on GoPro, HighREV, and EVRB demonstrate state-of-the-art performance and the effectiveness of event-guided temporal alignment for challenging blur scenarios.

Abstract

Video deblurring aims to enhance the quality of restored results in motion-blurred videos by effectively gathering information from adjacent video frames to compensate for the insufficient data in a single blurred frame. However, when faced with consecutively severe motion blur situations, frame-based video deblurring methods often fail to find accurate temporal correspondence among neighboring video frames, leading to diminished performance. To address this limitation, we aim to solve the video deblurring task by leveraging an event camera with micro-second temporal resolution. To fully exploit the dense temporal resolution of the event camera, we propose two modules: 1) Intra-frame feature enhancement operates within the exposure time of a single blurred frame, iteratively enhancing cross-modality features in a recurrent manner to better utilize the rich temporal information of events, 2) Inter-frame temporal feature alignment gathers valuable long-range temporal information to target frames, aggregating sharp features leveraging the advantages of the events. In addition, we present a novel dataset composed of real-world blurred RGB videos, corresponding sharp videos, and event data. This dataset serves as a valuable resource for evaluating event-guided deblurring methods. We demonstrate that our proposed methods outperform state-of-the-art frame-based and event-based motion deblurring methods through extensive experiments conducted on both synthetic and real-world deblurring datasets. The code and dataset are available at https://github.com/intelpro/CMTA.

CMTA: Cross-Modal Temporal Alignment for Event-guided Video Deblurring

TL;DR

CMTA tackles motion blur in video by leveraging an event camera stream and a sequence of blurred frames to estimate the latent sharp frame . Two core modules, Cross-modal Recurrent Intra-frame Feature Enhancement (CRIFE) and Event-guided Cascaded Inter-frame Temporal Feature Alignment (ECITFA), fuse features via cross-modal attention and perform cascaded temporal alignment across multiple scales. The authors introduce the EVRB dataset, a real-world collection of blurred RGB videos, corresponding sharp frames, and aligned event data captured with a triple-axis hybrid camera system. Empirical results on GoPro, HighREV, and EVRB demonstrate state-of-the-art performance and the effectiveness of event-guided temporal alignment for challenging blur scenarios.

Abstract

Video deblurring aims to enhance the quality of restored results in motion-blurred videos by effectively gathering information from adjacent video frames to compensate for the insufficient data in a single blurred frame. However, when faced with consecutively severe motion blur situations, frame-based video deblurring methods often fail to find accurate temporal correspondence among neighboring video frames, leading to diminished performance. To address this limitation, we aim to solve the video deblurring task by leveraging an event camera with micro-second temporal resolution. To fully exploit the dense temporal resolution of the event camera, we propose two modules: 1) Intra-frame feature enhancement operates within the exposure time of a single blurred frame, iteratively enhancing cross-modality features in a recurrent manner to better utilize the rich temporal information of events, 2) Inter-frame temporal feature alignment gathers valuable long-range temporal information to target frames, aggregating sharp features leveraging the advantages of the events. In addition, we present a novel dataset composed of real-world blurred RGB videos, corresponding sharp videos, and event data. This dataset serves as a valuable resource for evaluating event-guided deblurring methods. We demonstrate that our proposed methods outperform state-of-the-art frame-based and event-based motion deblurring methods through extensive experiments conducted on both synthetic and real-world deblurring datasets. The code and dataset are available at https://github.com/intelpro/CMTA.
Paper Structure (18 sections, 14 equations, 6 figures, 6 tables)

This paper contains 18 sections, 14 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Illustration of a hybrid camera system for real-world event-based video deblurring dataset. $S$ and $B$ denote the cameras for acquiring sharp and blur videos, respectively. (a): The triple-axis camera system to capture real-world blur. (b): A diagram of our hybrid camera system. (c): Samples from our EVRB dataset with natural blur.
  • Figure 2: Overall framework of CMTA is divided into two main components: Cross-modal Recurrent Intra-frame Feature Enhancement (CRIFE) and Event-guided Cascaded Inter-frame Temporal Feature Alignment (ECITFA). $s$ is the scale factor for multi-scale features. In the figure of the ECITFA module, the description was performed for the case of $P$=2 for simplicity.
  • Figure 3: Illustration of Cross-modal Recurrent Intra-frame Feature Enhancement (CRIFE).
  • Figure 4: Overview of the Event-guided Cascaded Inter-frame Temporal Feature Alignment (ECITFA). The left figure illustrates temporal alignment for scale $s$. The key module for each alignment procedure, Cross-modal Temporal Feature Alignment (CTFA) at time $t$, is illustrated on the right of the figure. The CTFA module operates similarly for reference times $t-1$ and $t+1$ as well.
  • Figure 5: Visual comparison of deblurring results on the GoPro dataset. The qualitative results of other methods were taken from the results provided by the authors.
  • ...and 1 more figures