Table of Contents
Fetching ...

Image Deblurring by Exploring In-depth Properties of Transformer

Pengwei Liang, Junjun Jiang, Xianming Liu, Jiayi Ma

TL;DR

The paper tackles the enduring trade-off between perceptual quality and quantitative fidelity in image deblurring. It introduces two transformer-based perceptual losses that exploit pretrained MAE representations: a local MAE perceptual loss operating in Euclidean space on MAE features, and a global distribution perceptual loss based on $p$-Wasserstein distances between MAE token-feature distributions. Through extensive experiments on defocus and motion deblurring (DPDD, GoPro, HIDE) and even deraining (Rain100H), the method achieves notable perceptual gains with minimal PSNR sacrifice, outperforming VGG-based perceptual losses across several baselines. Ablation studies confirm MAE’s superiority for this task, with middle-layer features and token representations providing the best balance between fidelity and perceptual realism. The findings suggest that leveraging in-depth transformer properties offers a robust, generalizable avenue for enhancing low-level image restoration tasks.

Abstract

Image deblurring continues to achieve impressive performance with the development of generative models. Nonetheless, there still remains a displeasing problem if one wants to improve perceptual quality and quantitative scores of recovered image at the same time. In this study, drawing inspiration from the research of transformer properties, we introduce the pretrained transformers to address this problem. In particular, we leverage deep features extracted from a pretrained vision transformer (ViT) to encourage recovered images to be sharp without sacrificing the performance measured by the quantitative metrics. The pretrained transformer can capture the global topological relations (i.e., self-similarity) of image, and we observe that the captured topological relations about the sharp image will change when blur occurs. By comparing the transformer features between recovered image and target one, the pretrained transformer provides high-resolution blur-sensitive semantic information, which is critical in measuring the sharpness of the deblurred image. On the basis of the advantages, we present two types of novel perceptual losses to guide image deblurring. One regards the features as vectors and computes the discrepancy between representations extracted from recovered image and target one in Euclidean space. The other type considers the features extracted from an image as a distribution and compares the distribution discrepancy between recovered image and target one. We demonstrate the effectiveness of transformer properties in improving the perceptual quality while not sacrificing the quantitative scores (PSNR) over the most competitive models, such as Uformer, Restormer, and NAFNet, on defocus deblurring and motion deblurring tasks.

Image Deblurring by Exploring In-depth Properties of Transformer

TL;DR

The paper tackles the enduring trade-off between perceptual quality and quantitative fidelity in image deblurring. It introduces two transformer-based perceptual losses that exploit pretrained MAE representations: a local MAE perceptual loss operating in Euclidean space on MAE features, and a global distribution perceptual loss based on -Wasserstein distances between MAE token-feature distributions. Through extensive experiments on defocus and motion deblurring (DPDD, GoPro, HIDE) and even deraining (Rain100H), the method achieves notable perceptual gains with minimal PSNR sacrifice, outperforming VGG-based perceptual losses across several baselines. Ablation studies confirm MAE’s superiority for this task, with middle-layer features and token representations providing the best balance between fidelity and perceptual realism. The findings suggest that leveraging in-depth transformer properties offers a robust, generalizable avenue for enhancing low-level image restoration tasks.

Abstract

Image deblurring continues to achieve impressive performance with the development of generative models. Nonetheless, there still remains a displeasing problem if one wants to improve perceptual quality and quantitative scores of recovered image at the same time. In this study, drawing inspiration from the research of transformer properties, we introduce the pretrained transformers to address this problem. In particular, we leverage deep features extracted from a pretrained vision transformer (ViT) to encourage recovered images to be sharp without sacrificing the performance measured by the quantitative metrics. The pretrained transformer can capture the global topological relations (i.e., self-similarity) of image, and we observe that the captured topological relations about the sharp image will change when blur occurs. By comparing the transformer features between recovered image and target one, the pretrained transformer provides high-resolution blur-sensitive semantic information, which is critical in measuring the sharpness of the deblurred image. On the basis of the advantages, we present two types of novel perceptual losses to guide image deblurring. One regards the features as vectors and computes the discrepancy between representations extracted from recovered image and target one in Euclidean space. The other type considers the features extracted from an image as a distribution and compares the distribution discrepancy between recovered image and target one. We demonstrate the effectiveness of transformer properties in improving the perceptual quality while not sacrificing the quantitative scores (PSNR) over the most competitive models, such as Uformer, Restormer, and NAFNet, on defocus deblurring and motion deblurring tasks.
Paper Structure (27 sections, 7 equations, 12 figures, 7 tables)

This paper contains 27 sections, 7 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Illustration the effectiveness of VGG and ViT features. The first/second row presents a blurry/ground truth example and their feature representations. By comparing the visualized features of the last column, we can observe that the image blur obviously changes the topological relations (i.e., self-similarity) of features extracted from the clean images.
  • Figure 2: The workflow illustration of proposed perceptual losses.
  • Figure 3: Similarity visualization of different ViT features on the image deblurring dataset. We extract the ViT features from the same ViT architecture training in different ways: (b) supervised ViT dosovitskiy2020vit, (c) self-supervised DINO caron2021emerging, and (d) MAE MaskedAutoencoders2021. The (b-d) show similarity heatmaps that are computed between a feature located at $\color{chromeyellow}\star$ and all features in this image.
  • Figure 4: Illustration of the prediction error measured by quantitative and perceptual measurement at different iterations of deblurring. To highlight the improvement in deblurring results, we show the residual maps that reflect the difference in deblurring results between the current iteration and the previous iteration.
  • Figure 5: A visual example on the dual-pixel DPDD dataset. For convenience, we show only the left view as the blurry image. From (a) to (d), we present cropped highlighted deblurring results to validate the effectiveness of proposed perceptual losses.
  • ...and 7 more figures