CMTA: Cross-Modal Temporal Alignment for Event-guided Video Deblurring
Taewoo Kim, Hoonhee Cho, Kuk-Jin Yoon
TL;DR
CMTA tackles motion blur in video by leveraging an event camera stream $\{\mathbb{E}_{k}\}$ and a sequence of blurred frames $\{B_{k}\}$ to estimate the latent sharp frame $S_t$. Two core modules, Cross-modal Recurrent Intra-frame Feature Enhancement (CRIFE) and Event-guided Cascaded Inter-frame Temporal Feature Alignment (ECITFA), fuse features via cross-modal attention and perform cascaded temporal alignment across multiple scales. The authors introduce the EVRB dataset, a real-world collection of blurred RGB videos, corresponding sharp frames, and aligned event data captured with a triple-axis hybrid camera system. Empirical results on GoPro, HighREV, and EVRB demonstrate state-of-the-art performance and the effectiveness of event-guided temporal alignment for challenging blur scenarios.
Abstract
Video deblurring aims to enhance the quality of restored results in motion-blurred videos by effectively gathering information from adjacent video frames to compensate for the insufficient data in a single blurred frame. However, when faced with consecutively severe motion blur situations, frame-based video deblurring methods often fail to find accurate temporal correspondence among neighboring video frames, leading to diminished performance. To address this limitation, we aim to solve the video deblurring task by leveraging an event camera with micro-second temporal resolution. To fully exploit the dense temporal resolution of the event camera, we propose two modules: 1) Intra-frame feature enhancement operates within the exposure time of a single blurred frame, iteratively enhancing cross-modality features in a recurrent manner to better utilize the rich temporal information of events, 2) Inter-frame temporal feature alignment gathers valuable long-range temporal information to target frames, aggregating sharp features leveraging the advantages of the events. In addition, we present a novel dataset composed of real-world blurred RGB videos, corresponding sharp videos, and event data. This dataset serves as a valuable resource for evaluating event-guided deblurring methods. We demonstrate that our proposed methods outperform state-of-the-art frame-based and event-based motion deblurring methods through extensive experiments conducted on both synthetic and real-world deblurring datasets. The code and dataset are available at https://github.com/intelpro/CMTA.
