Table of Contents
Fetching ...

A Unified Framework for Event-based Frame Interpolation with Ad-hoc Deblurring in the Wild

Lei Sun, Daniel Gehrig, Christos Sakaridis, Mathias Gehrig, Jingyun Liang, Peng Sun, Zhijie Xu, Kaiwei Wang, Luc Van Gool, Davide Scaramuzza

TL;DR

This work tackles robust event-based video frame interpolation under motion blur by developing REFID, a unified bidirectional recurrent network that performs ad-hoc deblurring to interpolate frames from sharp or blurry inputs using both frames and asynchronous events. The model fuses image and event features via a bidirectional event recurrent encoder and an Event-Guided Adaptive Channel Attention (EGACA) module, enabling accurate interpolation and deblurring in a single stage. To bridge synthetic-to-real gaps, the authors introduce a self-supervised fine-tuning framework with three losses (brightness increment, blur consistency, warp) and validate on a new HighREV dataset with high-resolution aligned events and RGB frames. Experiments show REFID achieves state-of-the-art performance on sharp and blurry frame interpolation and single-image deblurring, with strong generalization to real-world data thanks to SSL, highlighting practical impact for real-world event-based imaging systems.

Abstract

Effective video frame interpolation hinges on the adept handling of motion in the input scene. Prior work acknowledges asynchronous event information for this, but often overlooks whether motion induces blur in the video, limiting its scope to sharp frame interpolation. We instead propose a unified framework for event-based frame interpolation that performs deblurring ad-hoc and thus works both on sharp and blurry input videos. Our model consists in a bidirectional recurrent network that incorporates the temporal dimension of interpolation and fuses information from the input frames and the events adaptively based on their temporal proximity. To enhance the generalization from synthetic data to real event cameras, we integrate self-supervised framework with the proposed model to enhance the generalization on real-world datasets in the wild. At the dataset level, we introduce a novel real-world high-resolution dataset with events and color videos named HighREV, which provides a challenging evaluation setting for the examined task. Extensive experiments show that our network consistently outperforms previous state-of-the-art methods on frame interpolation, single image deblurring, and the joint task of both. Experiments on domain transfer reveal that self-supervised training effectively mitigates the performance degradation observed when transitioning from synthetic data to real-world data. Code and datasets are available at https://github.com/AHupuJR/REFID.

A Unified Framework for Event-based Frame Interpolation with Ad-hoc Deblurring in the Wild

TL;DR

This work tackles robust event-based video frame interpolation under motion blur by developing REFID, a unified bidirectional recurrent network that performs ad-hoc deblurring to interpolate frames from sharp or blurry inputs using both frames and asynchronous events. The model fuses image and event features via a bidirectional event recurrent encoder and an Event-Guided Adaptive Channel Attention (EGACA) module, enabling accurate interpolation and deblurring in a single stage. To bridge synthetic-to-real gaps, the authors introduce a self-supervised fine-tuning framework with three losses (brightness increment, blur consistency, warp) and validate on a new HighREV dataset with high-resolution aligned events and RGB frames. Experiments show REFID achieves state-of-the-art performance on sharp and blurry frame interpolation and single-image deblurring, with strong generalization to real-world data thanks to SSL, highlighting practical impact for real-world event-based imaging systems.

Abstract

Effective video frame interpolation hinges on the adept handling of motion in the input scene. Prior work acknowledges asynchronous event information for this, but often overlooks whether motion induces blur in the video, limiting its scope to sharp frame interpolation. We instead propose a unified framework for event-based frame interpolation that performs deblurring ad-hoc and thus works both on sharp and blurry input videos. Our model consists in a bidirectional recurrent network that incorporates the temporal dimension of interpolation and fuses information from the input frames and the events adaptively based on their temporal proximity. To enhance the generalization from synthetic data to real event cameras, we integrate self-supervised framework with the proposed model to enhance the generalization on real-world datasets in the wild. At the dataset level, we introduce a novel real-world high-resolution dataset with events and color videos named HighREV, which provides a challenging evaluation setting for the examined task. Extensive experiments show that our network consistently outperforms previous state-of-the-art methods on frame interpolation, single image deblurring, and the joint task of both. Experiments on domain transfer reveal that self-supervised training effectively mitigates the performance degradation observed when transitioning from synthetic data to real-world data. Code and datasets are available at https://github.com/AHupuJR/REFID.
Paper Structure (22 sections, 18 equations, 10 figures, 7 tables)

This paper contains 22 sections, 18 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Our unified framework for event-based sharp VFI (a) and blurry VFI (b). Red/blue dots: negative/positive events; Curly braces: exposure time range.
  • Figure 2: (a): The architecture of our Recurrent Event-based Frame Interpolation with ad-hoc Deblurring (REFID) network. The input of the image branch consists of two key frames and their corresponding events, and the event branch consumes sub-voxels of events recurrently. "EGACA": event-guided adaptive channel attention, "SConv": strided convolution, "TConv": transposed convolution. (b): The proposed bidirectional event recurrent (EVR) blocks. In each recurrent step, the events from the forward and backward direction are fed to the network. For notations, cf. \ref{['eq:evr_block']}.
  • Figure 3: Details of network inputs. Events within the exposure time $T$ and the blurry frame are unfolded into $N$ sharp images. Events are split into sub-intervals $\epsilon_i$, and two sub-intervals of events are used to compute 2-channel voxel grids $\textbf{V}_i$. $\epsilon_i$ is also used to predict optical flow $\textbf{u}_j$. Events are warped to produce IWE $\textbf{H}_j$ with $\textbf{u}_j$ for each sub-interval.
  • Figure 4: A example for self-supervised single-image deblurring, a sharp video clip is restored with a blurry image and corresponding events. From left to right: Visualized events, image of warped events (IWE), and resulting sharp video clip. IWE provides sharper edge information while events contribute to capturing the blurry shape information.
  • Figure 5: The Event-Guided Adaptive Channel Attention module. The channel weights for the image branch are extracted from the event branch.
  • ...and 5 more figures