Table of Contents
Fetching ...

Blur-aware Spatio-temporal Sparse Transformer for Video Deblurring

Huicong Zhang, Haozhe Xie, Hongxun Yao

TL;DR

BSSTNet tackles video deblurring by exploiting the observation that blur correlates with pixel displacement, and introduces blur maps to drive sparse spatio-temporal attention. It combines a non-learnable blur-map estimation with Blur-aware Bidirectional Feature Propagation (BBFP) and a Blur-aware Spatio-temporal Sparse Transformer (BSST) to extend temporal context while keeping computation manageable. Empirical results on GoPro and DVD show state-of-the-art PSNR/SSIM with substantial FLOPs reduction and faster runtimes, supported by comprehensive ablations. The method improves robustness to blur in distant frames and reduces error accumulation during bidirectional propagation, benefiting downstream video tasks that rely on sharp frames.

Abstract

Video deblurring relies on leveraging information from other frames in the video sequence to restore the blurred regions in the current frame. Mainstream approaches employ bidirectional feature propagation, spatio-temporal transformers, or a combination of both to extract information from the video sequence. However, limitations in memory and computational resources constraints the temporal window length of the spatio-temporal transformer, preventing the extraction of longer temporal contextual information from the video sequence. Additionally, bidirectional feature propagation is highly sensitive to inaccurate optical flow in blurry frames, leading to error accumulation during the propagation process. To address these issues, we propose \textbf{BSSTNet}, \textbf{B}lur-aware \textbf{S}patio-temporal \textbf{S}parse \textbf{T}ransformer Network. It introduces the blur map, which converts the originally dense attention into a sparse form, enabling a more extensive utilization of information throughout the entire video sequence. Specifically, BSSTNet (1) uses a longer temporal window in the transformer, leveraging information from more distant frames to restore the blurry pixels in the current frame. (2) introduces bidirectional feature propagation guided by blur maps, which reduces error accumulation caused by the blur frame. The experimental results demonstrate the proposed BSSTNet outperforms the state-of-the-art methods on the GoPro and DVD datasets.

Blur-aware Spatio-temporal Sparse Transformer for Video Deblurring

TL;DR

BSSTNet tackles video deblurring by exploiting the observation that blur correlates with pixel displacement, and introduces blur maps to drive sparse spatio-temporal attention. It combines a non-learnable blur-map estimation with Blur-aware Bidirectional Feature Propagation (BBFP) and a Blur-aware Spatio-temporal Sparse Transformer (BSST) to extend temporal context while keeping computation manageable. Empirical results on GoPro and DVD show state-of-the-art PSNR/SSIM with substantial FLOPs reduction and faster runtimes, supported by comprehensive ablations. The method improves robustness to blur in distant frames and reduces error accumulation during bidirectional propagation, benefiting downstream video tasks that rely on sharp frames.

Abstract

Video deblurring relies on leveraging information from other frames in the video sequence to restore the blurred regions in the current frame. Mainstream approaches employ bidirectional feature propagation, spatio-temporal transformers, or a combination of both to extract information from the video sequence. However, limitations in memory and computational resources constraints the temporal window length of the spatio-temporal transformer, preventing the extraction of longer temporal contextual information from the video sequence. Additionally, bidirectional feature propagation is highly sensitive to inaccurate optical flow in blurry frames, leading to error accumulation during the propagation process. To address these issues, we propose \textbf{BSSTNet}, \textbf{B}lur-aware \textbf{S}patio-temporal \textbf{S}parse \textbf{T}ransformer Network. It introduces the blur map, which converts the originally dense attention into a sparse form, enabling a more extensive utilization of information throughout the entire video sequence. Specifically, BSSTNet (1) uses a longer temporal window in the transformer, leveraging information from more distant frames to restore the blurry pixels in the current frame. (2) introduces bidirectional feature propagation guided by blur maps, which reduces error accumulation caused by the blur frame. The experimental results demonstrate the proposed BSSTNet outperforms the state-of-the-art methods on the GoPro and DVD datasets.
Paper Structure (13 sections, 10 equations, 6 figures, 7 tables)

This paper contains 13 sections, 10 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: (a) Large motions in optical flows are highlighted in blur maps. (b) Comparison of FLOPs between the standard spatio-temporal transformer and the blur-aware spatio-temporal transformer. (c-d) Summary of the standard spatio-temporal transformer and the blur-aware spatio-temporal transformer. (e-f) Summary of the standard flow-guided feature alignment and blur-aware feature alignment. (g) In the visual comparisons on the GoPro dataset, the proposed BSSTNet restores the sharpest frame.
  • Figure 2: Overview of the proposed BSSTNet. BSSTNet consists of three major components: Blur Map Estimation, Blur-aware Bidirectional Feature Propagation (BBFP), and Blur-aware Spatio-temporal Sparse Transformer (BSST).
  • Figure 3: The details of BFA. Note that Ⓦ, Ⓒ, and $\bigoplus$ denotes the "Warp", "Concatenation", and "Element-wise Add" operations, respectively.
  • Figure 4: The details of BSST. Note that Ⓟ denotes the "Window Partition" operation. "Flatten by window" indicates that query tokens are flattened for each query window, and K/V tokens are generated in a similar manner. Multi-head Self Attention is also computed on the query and K/V tokens generated for each window.
  • Figure 5: Qualitative comparison on the GoPro and DVD datasets. Note that "GT" stands for "Ground Truth". The proposed BSSTNet produces images with enhanced sharpness and more detailed visuals compared to competing methods.
  • ...and 1 more figures