LoFormer: Local Frequency Transformer for Image Deblurring

Xintian Mao; Jiansheng Wang; Xingran Xie; Qingli Li; Yan Wang

LoFormer: Local Frequency Transformer for Image Deblurring

Xintian Mao, Jiansheng Wang, Xingran Xie, Qingli Li, Yan Wang

TL;DR

LoFormer introduces a frequency-domain transformer approach for image deblurring that jointly models coarse and fine details without the high cost of global self-attention. The core building block, the LoFT, combines DCT-LN, Freq-LC and MGate to perform windowed, frequency-wise self-attention and gating, enabling efficient long-range dependencies with preserved textures. The paper provides theoretical insights showing the equivalence between Spa-GC and Freq-GC and analyzes Freq-LC from spatial and frequency viewpoints. Empirically, LoFormer achieves state-of-the-art PSNR on GoPro (34.09 dB at ~126G FLOPs) and strong results on RealBlur and REDS, demonstrating improved detail recovery and favorable efficiency. Overall, the method offers a principled, frequency-domain alternative to traditional spatial attention for high-quality, efficient image deblurring.

Abstract

Due to the computational complexity of self-attention (SA), prevalent techniques for image deblurring often resort to either adopting localized SA or employing coarse-grained global SA methods, both of which exhibit drawbacks such as compromising global modeling or lacking fine-grained correlation. In order to address this issue by effectively modeling long-range dependencies without sacrificing fine-grained details, we introduce a novel approach termed Local Frequency Transformer (LoFormer). Within each unit of LoFormer, we incorporate a Local Channel-wise SA in the frequency domain (Freq-LC) to simultaneously capture cross-covariance within low- and high-frequency local windows. These operations offer the advantage of (1) ensuring equitable learning opportunities for both coarse-grained structures and fine-grained details, and (2) exploring a broader range of representational properties compared to coarse-grained global SA methods. Additionally, we introduce an MLP Gating mechanism complementary to Freq-LC, which serves to filter out irrelevant features while enhancing global learning capabilities. Our experiments demonstrate that LoFormer significantly improves performance in the image deblurring task, achieving a PSNR of 34.09 dB on the GoPro dataset with 126G FLOPs. https://github.com/DeepMed-Lab-ECNU/Single-Image-Deblur

LoFormer: Local Frequency Transformer for Image Deblurring

TL;DR

Abstract

Paper Structure (39 sections, 2 theorems, 8 equations, 17 figures, 8 tables)

This paper contains 39 sections, 2 theorems, 8 equations, 17 figures, 8 tables.

Introduction
Related Works
Deep Image Deblurring
Low-level Vision Transformers
Frequency Domain Applications
Method
Main Backbone
Local Frequency Transformer Block
DCT-LN
Freq-LC
MGate
Complexity analysis
Understanding SA in the Frequency Domain
Spa-GC is equivalent to Freq-GC
Proof for Proposition 1
...and 24 more sections

Key Result

Proposition 1

Spa-GC: $\textbf{O}=\text{Attention}(\textbf{Q}, \textbf{\rm{K}}, \textbf{V})$ and Freq-GC: $\hat{\textbf{O}}_{f}=IDCT({\hat{\textbf{O}}})=IDCT(\text{Attention}(\hat{\textbf{Q}}, \hat{\textbf{K}}, \hat{\textbf{V}}))$ are identical, without considering depth-wise convolutions and DCT-LN, where querie

Figures (17)

Figure 1: Different architectures of global feature learning. (a) MLP method in MAXIM Tu2022maxim; (b) Window Self-Attention in Uformer Wang2022uformer; (c) Strip Self-Attention in Stripformer Tsai2022Stripformer; (d) Global Channel Self-Attention in Restormer Zamir2021restormer; (e) Local Frequency Self-Attention in LoFormer. The vision tokens in spatial domain are converted to the DCT coefficients (frequency tokens) of different DCT basis images.
Figure 2: PSNR vs. FLOPs on the GoPro and HIDE datasets. Our method performs much better than other methods, especially Transformer based methods highlighted in orange.
Figure 3: The visualization of various generated kernels and their impact on the sharpness of images within both the spatial and frequency domains. Specifically, the first row of the visualization pertains to the degraded image, while the second row illustrates the generated kernel. Furthermore, the odd columns represent the spatial domain, while the even columns depict the frequency domain.
Figure 4: Architecture of LoFormer. The main backbone of LoFormer is an UNet Ronneberger2015unet model built in Restormer Zamir2021restormer. The basic building block of LoFormer is Local Frequency Transformer block (LoFT-block) , which consists of a Local Frequency Network (LoFN) module and an Feed-Forward Network (FFN) module. The core components of LoFN are DCT-LN, Freq-LC and MGate on frequency windows that perform global context aggregation.
Figure 5: LayerNorm after DCT.
...and 12 more figures

Theorems & Definitions (2)

Proposition 1
Proposition 2

LoFormer: Local Frequency Transformer for Image Deblurring

TL;DR

Abstract

LoFormer: Local Frequency Transformer for Image Deblurring

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (2)