CMTA: Cross-Modal Temporal Alignment for Event-guided Video Deblurring

Taewoo Kim; Hoonhee Cho; Kuk-Jin Yoon

CMTA: Cross-Modal Temporal Alignment for Event-guided Video Deblurring

Taewoo Kim, Hoonhee Cho, Kuk-Jin Yoon

TL;DR

CMTA tackles motion blur in video by leveraging an event camera stream $\{\mathbb{E}_{k}\}$ and a sequence of blurred frames $\{B_{k}\}$ to estimate the latent sharp frame $S_t$. Two core modules, Cross-modal Recurrent Intra-frame Feature Enhancement (CRIFE) and Event-guided Cascaded Inter-frame Temporal Feature Alignment (ECITFA), fuse features via cross-modal attention and perform cascaded temporal alignment across multiple scales. The authors introduce the EVRB dataset, a real-world collection of blurred RGB videos, corresponding sharp frames, and aligned event data captured with a triple-axis hybrid camera system. Empirical results on GoPro, HighREV, and EVRB demonstrate state-of-the-art performance and the effectiveness of event-guided temporal alignment for challenging blur scenarios.

Abstract

Video deblurring aims to enhance the quality of restored results in motion-blurred videos by effectively gathering information from adjacent video frames to compensate for the insufficient data in a single blurred frame. However, when faced with consecutively severe motion blur situations, frame-based video deblurring methods often fail to find accurate temporal correspondence among neighboring video frames, leading to diminished performance. To address this limitation, we aim to solve the video deblurring task by leveraging an event camera with micro-second temporal resolution. To fully exploit the dense temporal resolution of the event camera, we propose two modules: 1) Intra-frame feature enhancement operates within the exposure time of a single blurred frame, iteratively enhancing cross-modality features in a recurrent manner to better utilize the rich temporal information of events, 2) Inter-frame temporal feature alignment gathers valuable long-range temporal information to target frames, aggregating sharp features leveraging the advantages of the events. In addition, we present a novel dataset composed of real-world blurred RGB videos, corresponding sharp videos, and event data. This dataset serves as a valuable resource for evaluating event-guided deblurring methods. We demonstrate that our proposed methods outperform state-of-the-art frame-based and event-based motion deblurring methods through extensive experiments conducted on both synthetic and real-world deblurring datasets. The code and dataset are available at https://github.com/intelpro/CMTA.

CMTA: Cross-Modal Temporal Alignment for Event-guided Video Deblurring

TL;DR

CMTA tackles motion blur in video by leveraging an event camera stream

and a sequence of blurred frames

to estimate the latent sharp frame

. Two core modules, Cross-modal Recurrent Intra-frame Feature Enhancement (CRIFE) and Event-guided Cascaded Inter-frame Temporal Feature Alignment (ECITFA), fuse features via cross-modal attention and perform cascaded temporal alignment across multiple scales. The authors introduce the EVRB dataset, a real-world collection of blurred RGB videos, corresponding sharp frames, and aligned event data captured with a triple-axis hybrid camera system. Empirical results on GoPro, HighREV, and EVRB demonstrate state-of-the-art performance and the effectiveness of event-guided temporal alignment for challenging blur scenarios.

Abstract

Paper Structure (18 sections, 14 equations, 6 figures, 6 tables)

This paper contains 18 sections, 14 equations, 6 figures, 6 tables.

Introduction
Related Works
Video Deblurring
Event-guided Motion Deblurring
Event-based Video Deblurring Dataset for Real-world Blur
Limitation of Synthetic Blur Dataset
Triple-axis Hybrid Camera System
Method
Overview
Cross-modal Recurrent Intra-frame Feature Enhancement
Event-guided Cascaded Inter-frame Temporal Feature Alignment
Decoder
Experiments
Datasets
Comparison on Synthetic Blur Datasets
...and 3 more sections

Figures (6)

Figure 1: Illustration of a hybrid camera system for real-world event-based video deblurring dataset. $S$ and $B$ denote the cameras for acquiring sharp and blur videos, respectively. (a): The triple-axis camera system to capture real-world blur. (b): A diagram of our hybrid camera system. (c): Samples from our EVRB dataset with natural blur.
Figure 2: Overall framework of CMTA is divided into two main components: Cross-modal Recurrent Intra-frame Feature Enhancement (CRIFE) and Event-guided Cascaded Inter-frame Temporal Feature Alignment (ECITFA). $s$ is the scale factor for multi-scale features. In the figure of the ECITFA module, the description was performed for the case of $P$=2 for simplicity.
Figure 3: Illustration of Cross-modal Recurrent Intra-frame Feature Enhancement (CRIFE).
Figure 4: Overview of the Event-guided Cascaded Inter-frame Temporal Feature Alignment (ECITFA). The left figure illustrates temporal alignment for scale $s$. The key module for each alignment procedure, Cross-modal Temporal Feature Alignment (CTFA) at time $t$, is illustrated on the right of the figure. The CTFA module operates similarly for reference times $t-1$ and $t+1$ as well.
Figure 5: Visual comparison of deblurring results on the GoPro dataset. The qualitative results of other methods were taken from the results provided by the authors.
...and 1 more figures

CMTA: Cross-Modal Temporal Alignment for Event-guided Video Deblurring

TL;DR

Abstract

CMTA: Cross-Modal Temporal Alignment for Event-guided Video Deblurring

Authors

TL;DR

Abstract

Table of Contents

Figures (6)