Table of Contents
Fetching ...

DIVD: Deblurring with Improved Video Diffusion Model

Haoyang Long, Yan Wang, Wendong Wang

TL;DR

This work tackles video deblurring by reframing it as a conditional diffusion problem and introducing two novel components: Window-based Temporal Self-Attention (WTSA) for parallel, windowed processing of multiple frames, and Multi-frame Relative Positional Encoding (MRPE) to provide complete temporal-spatial positional information. The combination enables implicit alignment and fusion of misaligned adjacent frames, yielding state-of-the-art perceptual quality while preserving detail and maintaining competitive distortion metrics. Extensive experiments on GOPRO and DVD demonstrate strong performance on perceptual metrics such as LPIPS, FID, and KID, with ablations validating the contribution of WTSA and MRPE. The approach highlights the importance of perceptual evaluation in image restoration and offers a scalable, diffusion-based solution for high-fidelity video deblurring, albeit with slower inference and a gap in PSNR compared to the current SOTA.

Abstract

Video deblurring presents a considerable challenge owing to the complexity of blur, which frequently results from a combination of camera shakes, and object motions. In the field of video deblurring, many previous works have primarily concentrated on distortion-based metrics, such as PSNR. However, this approach often results in a weak correlation with human perception and yields reconstructions that lack realism. Diffusion models and video diffusion models have respectively excelled in the fields of image and video generation, particularly achieving remarkable results in terms of image authenticity and realistic perception. However, due to the computational complexity and challenges inherent in adapting diffusion models, there is still uncertainty regarding the potential of video diffusion models in video deblurring tasks. To explore the viability of video diffusion models in the task of video deblurring, we introduce a diffusion model specifically for this purpose. In this field, leveraging highly correlated information between adjacent frames and addressing the challenge of temporal misalignment are crucial research directions. To tackle these challenges, many improvements based on the video diffusion model are introduced in this work. As a result, our model outperforms existing models and achieves state-of-the-art results on a range of perceptual metrics. Our model preserves a significant amount of detail in the images while maintaining competitive distortion metrics. Furthermore, to the best of our knowledge, this is the first time the diffusion model has been applied in video deblurring to overcome the limitations mentioned above.

DIVD: Deblurring with Improved Video Diffusion Model

TL;DR

This work tackles video deblurring by reframing it as a conditional diffusion problem and introducing two novel components: Window-based Temporal Self-Attention (WTSA) for parallel, windowed processing of multiple frames, and Multi-frame Relative Positional Encoding (MRPE) to provide complete temporal-spatial positional information. The combination enables implicit alignment and fusion of misaligned adjacent frames, yielding state-of-the-art perceptual quality while preserving detail and maintaining competitive distortion metrics. Extensive experiments on GOPRO and DVD demonstrate strong performance on perceptual metrics such as LPIPS, FID, and KID, with ablations validating the contribution of WTSA and MRPE. The approach highlights the importance of perceptual evaluation in image restoration and offers a scalable, diffusion-based solution for high-fidelity video deblurring, albeit with slower inference and a gap in PSNR compared to the current SOTA.

Abstract

Video deblurring presents a considerable challenge owing to the complexity of blur, which frequently results from a combination of camera shakes, and object motions. In the field of video deblurring, many previous works have primarily concentrated on distortion-based metrics, such as PSNR. However, this approach often results in a weak correlation with human perception and yields reconstructions that lack realism. Diffusion models and video diffusion models have respectively excelled in the fields of image and video generation, particularly achieving remarkable results in terms of image authenticity and realistic perception. However, due to the computational complexity and challenges inherent in adapting diffusion models, there is still uncertainty regarding the potential of video diffusion models in video deblurring tasks. To explore the viability of video diffusion models in the task of video deblurring, we introduce a diffusion model specifically for this purpose. In this field, leveraging highly correlated information between adjacent frames and addressing the challenge of temporal misalignment are crucial research directions. To tackle these challenges, many improvements based on the video diffusion model are introduced in this work. As a result, our model outperforms existing models and achieves state-of-the-art results on a range of perceptual metrics. Our model preserves a significant amount of detail in the images while maintaining competitive distortion metrics. Furthermore, to the best of our knowledge, this is the first time the diffusion model has been applied in video deblurring to overcome the limitations mentioned above.

Paper Structure

This paper contains 24 sections, 4 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: The trend of PSNR ($\uparrow$), FID ($\downarrow$), and LPIPS ($\downarrow$) changes according to the smoothness of the images. We sample all the images from the GoPro nah2017deep test set. SA-x refers to "Sample average" where x indicates the number of images averaged, and a larger x results in smoother images. Base refers to the sample for once.
  • Figure 2: The texture and details in the single-sampled image more closely resemble the ground truth (GT) image. In contrast, the SA-8 (Sample for 8 times and average) image notably lacks background details and displays overly smooth edges. Despite achieving a higher PSNR score, the SA-8 image is distinguishable to the human eye as unrealistic, reflecting its low perceptual quality.
  • Figure 3: Model Architecture. (a) The overall process of the model: Inputting concatenated noisy and blurry images to obtain clear images through $T$ iterations of denoising. (b) The structure of all blocks in the model incorporates joint position encoding, which plays a crucial role within the blocks.(more details in Fig. \ref{['fig:frame_position']}) (c) Window-based Temporal Attention Module (WTSA): Features segmented by the window undergo self-attention operations, aiding in the alignment and fusion of features from misaligned frames.
  • Figure 4: Architecture of Multi-frame Relative Positional Encoding (MRPE) consists of two components. Multi-frame positional encoding incorporates learnable position encodings, enabling the model to capture temporal information between frames. Relative Position Bias is utilized within the attention mechanism to obtain spatial positional information of frames.
  • Figure 5: When dealing with moving objects (such as wheels), our model can maximally restore their structure and retain the most details, rather than producing overly smooth images.
  • ...and 6 more figures