Image Deblurring by Exploring In-depth Properties of Transformer
Pengwei Liang, Junjun Jiang, Xianming Liu, Jiayi Ma
TL;DR
The paper tackles the enduring trade-off between perceptual quality and quantitative fidelity in image deblurring. It introduces two transformer-based perceptual losses that exploit pretrained MAE representations: a local MAE perceptual loss operating in Euclidean space on MAE features, and a global distribution perceptual loss based on $p$-Wasserstein distances between MAE token-feature distributions. Through extensive experiments on defocus and motion deblurring (DPDD, GoPro, HIDE) and even deraining (Rain100H), the method achieves notable perceptual gains with minimal PSNR sacrifice, outperforming VGG-based perceptual losses across several baselines. Ablation studies confirm MAE’s superiority for this task, with middle-layer features and token representations providing the best balance between fidelity and perceptual realism. The findings suggest that leveraging in-depth transformer properties offers a robust, generalizable avenue for enhancing low-level image restoration tasks.
Abstract
Image deblurring continues to achieve impressive performance with the development of generative models. Nonetheless, there still remains a displeasing problem if one wants to improve perceptual quality and quantitative scores of recovered image at the same time. In this study, drawing inspiration from the research of transformer properties, we introduce the pretrained transformers to address this problem. In particular, we leverage deep features extracted from a pretrained vision transformer (ViT) to encourage recovered images to be sharp without sacrificing the performance measured by the quantitative metrics. The pretrained transformer can capture the global topological relations (i.e., self-similarity) of image, and we observe that the captured topological relations about the sharp image will change when blur occurs. By comparing the transformer features between recovered image and target one, the pretrained transformer provides high-resolution blur-sensitive semantic information, which is critical in measuring the sharpness of the deblurred image. On the basis of the advantages, we present two types of novel perceptual losses to guide image deblurring. One regards the features as vectors and computes the discrepancy between representations extracted from recovered image and target one in Euclidean space. The other type considers the features extracted from an image as a distribution and compares the distribution discrepancy between recovered image and target one. We demonstrate the effectiveness of transformer properties in improving the perceptual quality while not sacrificing the quantitative scores (PSNR) over the most competitive models, such as Uformer, Restormer, and NAFNet, on defocus deblurring and motion deblurring tasks.
