Table of Contents
Fetching ...

Rethinking Video Deblurring with Wavelet-Aware Dynamic Transformer and Diffusion Model

Chen Rao, Guangyuan Li, Zehua Lan, Jiakai Sun, Junsheng Luan, Wei Xing, Lei Zhao, Huaizhong Lin, Jianfeng Dong, Dalong Zhang

TL;DR

This work tackles the challenge of recovering high-frequency details in video deblurring by integrating a diffusion model (DM) with a Wavelet-Aware Dynamic Transformer (WADT). The DM operates in a compact latent space to produce prior features $z' \in \mathbb{R}^{T\times 4C'}$, conditioned on the blurred input, which are fused by WADT to restore both low- and high-frequency content in $V_{HQ}$ from $V_{blur}$. Key contributions include the Wavelet-based decomposition within WADT, the Wavelet-based Bidirectional Propagation Fuse (WBPF), and a three-stage training strategy that jointly optimizes deblurring and diffusion objectives. Experiments on GoPro, DVD, BSD, and Real-World datasets show state-of-the-art performance with improved texture detail, temporal consistency, and efficiency due to latent-space diffusion steps (e.g., $T=4$). The approach holds practical significance for high-fidelity video restoration in real-world, blur-impaired footage.

Abstract

Current video deblurring methods have limitations in recovering high-frequency information since the regression losses are conservative with high-frequency details. Since Diffusion Models (DMs) have strong capabilities in generating high-frequency details, we consider introducing DMs into the video deblurring task. However, we found that directly applying DMs to the video deblurring task has the following problems: (1) DMs require many iteration steps to generate videos from Gaussian noise, which consumes many computational resources. (2) DMs are easily misled by the blurry artifacts in the video, resulting in irrational content and distortion of the deblurred video. To address the above issues, we propose a novel video deblurring framework VD-Diff that integrates the diffusion model into the Wavelet-Aware Dynamic Transformer (WADT). Specifically, we perform the diffusion model in a highly compact latent space to generate prior features containing high-frequency information that conforms to the ground truth distribution. We design the WADT to preserve and recover the low-frequency information in the video while utilizing the high-frequency information generated by the diffusion model. Extensive experiments show that our proposed VD-Diff outperforms SOTA methods on GoPro, DVD, BSD, and Real-World Video datasets.

Rethinking Video Deblurring with Wavelet-Aware Dynamic Transformer and Diffusion Model

TL;DR

This work tackles the challenge of recovering high-frequency details in video deblurring by integrating a diffusion model (DM) with a Wavelet-Aware Dynamic Transformer (WADT). The DM operates in a compact latent space to produce prior features , conditioned on the blurred input, which are fused by WADT to restore both low- and high-frequency content in from . Key contributions include the Wavelet-based decomposition within WADT, the Wavelet-based Bidirectional Propagation Fuse (WBPF), and a three-stage training strategy that jointly optimizes deblurring and diffusion objectives. Experiments on GoPro, DVD, BSD, and Real-World datasets show state-of-the-art performance with improved texture detail, temporal consistency, and efficiency due to latent-space diffusion steps (e.g., ). The approach holds practical significance for high-fidelity video restoration in real-world, blur-impaired footage.

Abstract

Current video deblurring methods have limitations in recovering high-frequency information since the regression losses are conservative with high-frequency details. Since Diffusion Models (DMs) have strong capabilities in generating high-frequency details, we consider introducing DMs into the video deblurring task. However, we found that directly applying DMs to the video deblurring task has the following problems: (1) DMs require many iteration steps to generate videos from Gaussian noise, which consumes many computational resources. (2) DMs are easily misled by the blurry artifacts in the video, resulting in irrational content and distortion of the deblurred video. To address the above issues, we propose a novel video deblurring framework VD-Diff that integrates the diffusion model into the Wavelet-Aware Dynamic Transformer (WADT). Specifically, we perform the diffusion model in a highly compact latent space to generate prior features containing high-frequency information that conforms to the ground truth distribution. We design the WADT to preserve and recover the low-frequency information in the video while utilizing the high-frequency information generated by the diffusion model. Extensive experiments show that our proposed VD-Diff outperforms SOTA methods on GoPro, DVD, BSD, and Real-World Video datasets.
Paper Structure (17 sections, 17 equations, 9 figures, 5 tables)

This paper contains 17 sections, 17 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: The overall architecture of the proposed VD-Diff, which consists of Wavelet-Aware Dynamic Transformer (WADT) and Diffusion Model (DM). Specifically, WADT adopts Wavelet Transform (WT) for feature separation, WADT Layer (WADTL) for deep feature extraction, and Wavelet-based Bidirectional Propagation Fuse (WBPF) for spatio-temporal information propagation between frames. The DM generates prior features to supplement high-frequency information for the deblurring process in WADTL.WT: Wavelet Transform. IWT: Inverse Wavelet Transform.
  • Figure 2: The illustration of Wavelet-Aware Dynamic Transformer Layer (WADTL), which consists of Wavelet-Aware Dynamic Multi-head Self-Attention (WAD-MSA) and Wavelet-Aware Dynamic Feed-Forward Network (WAD-FFN).
  • Figure 3: The structure of the Forward Process in the WBPF. The Backward Process has the same network structure as the Forward Process, but the direction of information propagation between frames is opposite.
  • Figure 4: Visual comparison on GoPro GoPro dataset. The deblurred results of previous work still contain significant blur effects. Our method generates much clearer frames.
  • Figure 5: Visual comparison on DVD DVD dataset. The deblurring effect of our proposed model is significantly better.
  • ...and 4 more figures