Table of Contents
Fetching ...

Aggregating Nearest Sharp Features via Hybrid Transformers for Video Deblurring

Wei Shang, Dongwei Ren, Yi Yang, Wangmeng Zuo

TL;DR

This work tackles real-world video deblurring where sharp frames appear non-consecutively by proposing a framework that detects sharp frames and jointly aggregates information from temporal neighbors and nearest sharp frames using a hybrid Transformer. A blur-aware BiLSTM detector identifies sharp frames, while a window-based local Transformer fuses neighboring frames and a global Transformer matches and integrates nearest sharp textures across scales, with an optional event fusion module for event-driven deblurring. The approach achieves state-of-the-art or competitive results on GOPRO, REDS, BSD, and event datasets (CED, RBE), with improved generalization to real-world blur and robust performance when sharp-frame guidance is scarce. The method is efficient relative to large Transformer baselines and offers practical impact for real-world video restoration and potential extensions to related vision tasks such as video super-resolution and multi-frame interpolation.

Abstract

Video deblurring methods, aiming at recovering consecutive sharp frames from a given blurry video, usually assume that the input video suffers from consecutively blurry frames. However, in real-world scenarios captured by modern imaging devices, sharp frames often interspersed within the video, providing temporally nearest sharp features that can aid in the restoration of blurry frames. In this work, we propose a video deblurring method that leverages both neighboring frames and existing sharp frames using hybrid Transformers for feature aggregation. Specifically, we first train a blur-aware detector to distinguish between sharp and blurry frames. Then, a window-based local Transformer is employed for exploiting features from neighboring frames, where cross attention is beneficial for aggregating features from neighboring frames without explicit spatial alignment. To aggregate nearest sharp features from detected sharp frames, we utilize a global Transformer with multi-scale matching capability. Moreover, our method can easily be extended to event-driven video deblurring by incorporating an event fusion module into the global Transformer. Extensive experiments on benchmark datasets demonstrate that our proposed method outperforms state-of-the-art video deblurring methods as well as event-driven video deblurring methods in terms of quantitative metrics and visual quality. The source code and trained models are available at https://github.com/shangwei5/STGTN.

Aggregating Nearest Sharp Features via Hybrid Transformers for Video Deblurring

TL;DR

This work tackles real-world video deblurring where sharp frames appear non-consecutively by proposing a framework that detects sharp frames and jointly aggregates information from temporal neighbors and nearest sharp frames using a hybrid Transformer. A blur-aware BiLSTM detector identifies sharp frames, while a window-based local Transformer fuses neighboring frames and a global Transformer matches and integrates nearest sharp textures across scales, with an optional event fusion module for event-driven deblurring. The approach achieves state-of-the-art or competitive results on GOPRO, REDS, BSD, and event datasets (CED, RBE), with improved generalization to real-world blur and robust performance when sharp-frame guidance is scarce. The method is efficient relative to large Transformer baselines and offers practical impact for real-world video restoration and potential extensions to related vision tasks such as video super-resolution and multi-frame interpolation.

Abstract

Video deblurring methods, aiming at recovering consecutive sharp frames from a given blurry video, usually assume that the input video suffers from consecutively blurry frames. However, in real-world scenarios captured by modern imaging devices, sharp frames often interspersed within the video, providing temporally nearest sharp features that can aid in the restoration of blurry frames. In this work, we propose a video deblurring method that leverages both neighboring frames and existing sharp frames using hybrid Transformers for feature aggregation. Specifically, we first train a blur-aware detector to distinguish between sharp and blurry frames. Then, a window-based local Transformer is employed for exploiting features from neighboring frames, where cross attention is beneficial for aggregating features from neighboring frames without explicit spatial alignment. To aggregate nearest sharp features from detected sharp frames, we utilize a global Transformer with multi-scale matching capability. Moreover, our method can easily be extended to event-driven video deblurring by incorporating an event fusion module into the global Transformer. Extensive experiments on benchmark datasets demonstrate that our proposed method outperforms state-of-the-art video deblurring methods as well as event-driven video deblurring methods in terms of quantitative metrics and visual quality. The source code and trained models are available at https://github.com/shangwei5/STGTN.
Paper Structure (25 sections, 9 equations, 13 figures, 10 tables, 1 algorithm)

This paper contains 25 sections, 9 equations, 13 figures, 10 tables, 1 algorithm.

Figures (13)

  • Figure 1: Examples of sharp frames observed in real-world blurry videos zhong2023real, where # denotes the frame number in a captured video. Nearest sharp features extracted from these sharp frames can be leveraged to enhance the restoration of the corresponding blurry frame.
  • Figure 2: The architecture of the blur-aware detector is designed to distinguish between sharp and blurry frames, where a set of 5 adjacent frames, i.e., $\bm{B}_{i-2},\cdots,\bm{B}_{i},\cdots,\bm{B}_{i+2}$, are taken as input. The current frame $\bm{B}_i$ is then classified as either blurry or sharp. During the training process of the detector, we introduce contrastive loss Eq. \ref{['eq: contrast']} is introduced to enhance the discrimination between blurry and sharp frames.
  • Figure 3: The flowchart of $\mathcal{F}_\text{HybFormer}$ for restoring a blurry frame $\bm{B}_i$ with its neighboring frames $\bm{B}_{i-1}$ and $\bm{B}_{i+1}$, as well as its corresponding detected sharp frames $\bm{G}_{i}^- \text{ and } \bm{G}_{i}^+$, is presented. The restoration process in $\mathcal{F}_\text{HybFormer}$ involves four steps: extracting features from all frames using a three-scale CNN, fusing the features of adjacent frames by cross-attention shifted window Transformer (CSWT) blocks, aggregating nearest sharp features using global Transformer, and finally reconstructing the latent frame $\hat{\bm{I}}_i$ with a decoder based on a three-scale CNN. Details regarding the window-based local Transformer and global Transformer can be found in Figs. \ref{['fig: cswt']} and \ref{['fig: msm']}, respectively.
  • Figure 4: The architecture of CSWT Blocks $f_\text{cswt}$ for fusing adjacent frame features, where the third scale features $\bm{b}^3_i$ and $\bm{b}^3_{i+1}$ of frames $\bm{B}_i$ and $\bm{B}_{i+1}$ can be aggregated using cross-attention without explicit spatial alignment.
  • Figure 5: Global attention of nearest sharp feature $\bm{g}^{+,3}_i$ in rear frame $\bm{G}^+_i$ and feature $\bm{f}$ in the third scale. The global similarity for each patch is recorded in $\bm{S}^+$, enabling identification of the most relevant patches in the index matrix $\bm{X}^+$ along with their corresponding confidence map $\bm{M}^+$. These patches are then folded into a new feature representation denoted as $\bm{g}_i^{+,'}$. Similarly, the nearest sharp features $\bm{g}_i^{-,3}$ from the front frame $\bm{G}_i^-$ can be processed in the same way.
  • ...and 8 more figures