Table of Contents
Fetching ...

From Attention to Frequency: Integration of Vision Transformer and FFT-ReLU for Enhanced Image Deblurring

Syed Mumtahin Mahmud, Mahdi Mohd Hossain Noki, Prothito Shovon Majumder, Abdul Mohaimen Al Radi, Md. Haider Ali, Md. Mosaddek Khan

TL;DR

This paper tackles high-resolution image deblurring by introducing a dual-domain architecture that unifies spatial attention via a Vision Transformer with a frequency-domain FFT-ReLU sparsity module. The method alternates between a Transformer-based pre-processing stage that reduces blur kernel ambiguity and an FFT-based deconvolution stage that enforces sparsity and suppresses artifacts, incorporating TV/L0 priors for edge preservation. Extensive experiments on real-world benchmarks show superior PSNR/SSIM and perceptual quality, corroborated by a human visual preference study that favors the proposed outputs. The approach demonstrates practical efficiency, with favorable runtime and memory characteristics, and establishes a generalizable paradigm for real-world image restoration that can accommodate lightweight backbones in the future.

Abstract

Image deblurring is vital in computer vision, aiming to recover sharp images from blurry ones caused by motion or camera shake. While deep learning approaches such as CNNs and Vision Transformers (ViTs) have advanced this field, they often struggle with complex or high-resolution blur and computational demands. We propose a new dual-domain architecture that unifies Vision Transformers with a frequency-domain FFT-ReLU module, explicitly bridging spatial attention modeling and frequency sparsity. In this structure, the ViT backbone captures local and global dependencies, while the FFT-ReLU component enforces frequency-domain sparsity to suppress blur-related artifacts and preserve fine details. Extensive experiments on benchmark datasets demonstrate that this architecture achieves superior PSNR, SSIM, and perceptual quality compared to state-of-the-art models. Both quantitative metrics, qualitative comparisons, and human preference evaluations confirm its effectiveness, establishing a practical and generalizable paradigm for real-world image restoration.

From Attention to Frequency: Integration of Vision Transformer and FFT-ReLU for Enhanced Image Deblurring

TL;DR

This paper tackles high-resolution image deblurring by introducing a dual-domain architecture that unifies spatial attention via a Vision Transformer with a frequency-domain FFT-ReLU sparsity module. The method alternates between a Transformer-based pre-processing stage that reduces blur kernel ambiguity and an FFT-based deconvolution stage that enforces sparsity and suppresses artifacts, incorporating TV/L0 priors for edge preservation. Extensive experiments on real-world benchmarks show superior PSNR/SSIM and perceptual quality, corroborated by a human visual preference study that favors the proposed outputs. The approach demonstrates practical efficiency, with favorable runtime and memory characteristics, and establishes a generalizable paradigm for real-world image restoration that can accommodate lightweight backbones in the future.

Abstract

Image deblurring is vital in computer vision, aiming to recover sharp images from blurry ones caused by motion or camera shake. While deep learning approaches such as CNNs and Vision Transformers (ViTs) have advanced this field, they often struggle with complex or high-resolution blur and computational demands. We propose a new dual-domain architecture that unifies Vision Transformers with a frequency-domain FFT-ReLU module, explicitly bridging spatial attention modeling and frequency sparsity. In this structure, the ViT backbone captures local and global dependencies, while the FFT-ReLU component enforces frequency-domain sparsity to suppress blur-related artifacts and preserve fine details. Extensive experiments on benchmark datasets demonstrate that this architecture achieves superior PSNR, SSIM, and perceptual quality compared to state-of-the-art models. Both quantitative metrics, qualitative comparisons, and human preference evaluations confirm its effectiveness, establishing a practical and generalizable paradigm for real-world image restoration.

Paper Structure

This paper contains 17 sections, 8 figures, 2 tables, 3 algorithms.

Figures (8)

  • Figure 1: (a) Uses a Vision Transformer to extract features from blurred images and reduce the blur kernel for further processing. (b) applies FFT-based blind and non-blind deconvolution, utilizing ReLU sparsity, to restore a sharp image
  • Figure 2: Qualitative image deblurring comparison between state-of-the-art models and our method on GoPro Gopro. Best viewed on a high-definition monitor when zoomed in
  • Figure 3: Qualitative image deblurring comparison, with the image captured in daylight, between state-of-the-art models and our methods on RealBlur Realblur dataset. Best viewed on a high-definition monitor when zoomed in
  • Figure 4: Qualitative image deblurring comparison, with the image captured at night, between state-of-the-art models and our method on the RealBlur Realblur dataset. Best viewed on a high-definition monitor when zoomed in
  • Figure 5: Qualitative image deblurring comparison, with the image captured in the night, between state-of-the-art models and our methods on RealBlur Realblur dataset. Best viewed on a high-definition monitor when zoomed in
  • ...and 3 more figures