Table of Contents
Fetching ...

VDPI: Video Deblurring with Pseudo-inverse Modeling

Zhihao Huang, Santiago Lopez-Tapia, Aggelos K. Katsaggelos

TL;DR

VDPI tackles video deblurring by combining explicit blur modeling with learning. It learns a blur operator $H$ and a pseudo-inverse $H^+$ via CNNs, then feeds $H^+ y$ along with the observed frames into a variational deep network that is conditioned on a latent variable $c$ to capture domain priors. The approach combines a CNN-based blur fitting stage, a constrained inverse simulation stage, and a VAE-augmented UNet to achieve robust, high-quality restoration across diverse datasets. Experiments on GoPro, DVDb, and REDS show state-of-the-art PSNR, SSIM, and LPIPS, with improved contour fidelity and temporal coherence across challenging blur scenarios.

Abstract

Video deblurring is a challenging task that aims to recover sharp sequences from blur and noisy observations. The image-formation model plays a crucial role in traditional model-based methods, constraining the possible solutions. However, this is only the case for some deep learning-based methods. Despite deep-learning models achieving better results, traditional model-based methods remain widely popular due to their flexibility. An increasing number of scholars combine the two to achieve better deblurring performance. This paper proposes introducing knowledge of the image-formation model into a deep learning network by using the pseudo-inverse of the blur. We use a deep network to fit the blurring and estimate pseudo-inverse. Then, we use this estimation, combined with a variational deep-learning network, to deblur the video sequence. Notably, our experimental results demonstrate that such modifications can significantly improve the performance of deep learning models for video deblurring. Furthermore, our experiments on different datasets achieved notable performance improvements, proving that our proposed method can generalize to different scenarios and cameras.

VDPI: Video Deblurring with Pseudo-inverse Modeling

TL;DR

VDPI tackles video deblurring by combining explicit blur modeling with learning. It learns a blur operator and a pseudo-inverse via CNNs, then feeds along with the observed frames into a variational deep network that is conditioned on a latent variable to capture domain priors. The approach combines a CNN-based blur fitting stage, a constrained inverse simulation stage, and a VAE-augmented UNet to achieve robust, high-quality restoration across diverse datasets. Experiments on GoPro, DVDb, and REDS show state-of-the-art PSNR, SSIM, and LPIPS, with improved contour fidelity and temporal coherence across challenging blur scenarios.

Abstract

Video deblurring is a challenging task that aims to recover sharp sequences from blur and noisy observations. The image-formation model plays a crucial role in traditional model-based methods, constraining the possible solutions. However, this is only the case for some deep learning-based methods. Despite deep-learning models achieving better results, traditional model-based methods remain widely popular due to their flexibility. An increasing number of scholars combine the two to achieve better deblurring performance. This paper proposes introducing knowledge of the image-formation model into a deep learning network by using the pseudo-inverse of the blur. We use a deep network to fit the blurring and estimate pseudo-inverse. Then, we use this estimation, combined with a variational deep-learning network, to deblur the video sequence. Notably, our experimental results demonstrate that such modifications can significantly improve the performance of deep learning models for video deblurring. Furthermore, our experiments on different datasets achieved notable performance improvements, proving that our proposed method can generalize to different scenarios and cameras.
Paper Structure (11 sections, 14 equations, 15 figures, 7 tables)

This paper contains 11 sections, 14 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: The proposed network framework consists of three main components: Blur Estimation, Pseudo-inverse Estimation, and Deep Variational Network.
  • Figure 2: The network to simulate blurring kernels ${\mathbf{H}}$, which consisted of NAFNet Block, Down-Sample Block, Up-sample Block. Particularly, We use NAFNet as both the encoder and decoder. At each upsampling level, the network outputs the corresponding blurring simulations' features ${\mathbf{H}}_0$, ${\mathbf{H}}_1$, ${\mathbf{H}}_2$ (with shape 32, 64 and 128). It is important to note that the network's input is only the blurred image y. While we often have sharp images as ground truth during training, in some test sets, we only have the blurred images. Therefore, we combine the ground truth to calculate the loss during training, but the network's input does not require the sharp image x. The intuitive understanding is that once the network is trained, the input can be any image, whether sharp or blurred.
  • Figure 3: BlurDictModel structure, which consisted of two branches: input branch and ${\mathbf{H}}$ branch. The key feature extractors are two Conv3d layers with kernel size $1\times 15\times 15$. One of them will output the feature with channel size 50 to multiply with input feature ${\mathbf{H}}$, then merge with the feature from input branch.
  • Figure 4: The entire structure of applying blurring kernels to the input. Combining different dimension levels and output three dimension outputs ${\mathbf{H}}{\mathbf{x}}_0$, ${\mathbf{H}}{\mathbf{x}}_1$, ${\mathbf{H}}{\mathbf{x}}_2$. Which could be used for any input, not only ${\mathbf{x}}$.
  • Figure 5: The network to calculate pseudo-inverse kernels ${\mathbf{H}}^+$, which uses the same encoder and decoder structure with blurring simulation. At each upsampling level, the network outputs the corresponding blurring kernels ${\mathbf{H}}^+_0$, ${\mathbf{H}}^+_1$, and ${\mathbf{H}}^+_2$.
  • ...and 10 more figures