Table of Contents
Fetching ...

Exploring Frequency-Inspired Optimization in Transformer for Efficient Single Image Super-Resolution

Ao Li, Le Zhang, Yun Liu, Ce Zhu

TL;DR

This work investigates how frequency information affects CNN and transformer-based single image super-resolution, revealing that transformers excel at low-frequency content but struggle with high-frequency details. It introduces CRAFT, a cross-refinement adaptive feature modulation transformer that combines a High-Frequency Enhancement Residual Block (HFERB), a Shift Rectangle Window Attention Block (SRWAB), and a Hybrid Fusion Block (HFB) to fuse high-frequency priors with global representations. To enable practical deployment, the authors propose a frequency-guided PTQ strategy with adaptive dual clipping and boundary refinement, and extend it to transformer-based SR models, achieving notable performance gains at 4-bit quantization. Empirical results on DIV2K and standard benchmarks demonstrate state-of-the-art performance in both full-precision and quantized regimes, with reduced parameter counts and improved efficiency, validating the universality and effectiveness of the frequency-guided PTQ approach.

Abstract

Transformer-based methods have exhibited remarkable potential in single image super-resolution (SISR) by effectively extracting long-range dependencies. However, most of the current research in this area has prioritized the design of transformer blocks to capture global information, while overlooking the importance of incorporating high-frequency priors, which we believe could be beneficial. In our study, we conducted a series of experiments and found that transformer structures are more adept at capturing low-frequency information, but have limited capacity in constructing high-frequency representations when compared to their convolutional counterparts. Our proposed solution, the cross-refinement adaptive feature modulation transformer (CRAFT), integrates the strengths of both convolutional and transformer structures. It comprises three key components: the high-frequency enhancement residual block (HFERB) for extracting high-frequency information, the shift rectangle window attention block (SRWAB) for capturing global information, and the hybrid fusion block (HFB) for refining the global representation. To tackle the inherent intricacies of transformer structures, we introduce a frequency-guided post-training quantization (PTQ) method aimed at enhancing CRAFT's efficiency. These strategies incorporate adaptive dual clipping and boundary refinement. To further amplify the versatility of our proposed approach, we extend our PTQ strategy to function as a general quantization method for transformer-based SISR techniques. Our experimental findings showcase CRAFT's superiority over current state-of-the-art methods, both in full-precision and quantization scenarios. These results underscore the efficacy and universality of our PTQ strategy. The source code is available at: https://github.com/AVC2-UESTC/Frequency-Inspired-Optimization-for-EfficientSR.git.

Exploring Frequency-Inspired Optimization in Transformer for Efficient Single Image Super-Resolution

TL;DR

This work investigates how frequency information affects CNN and transformer-based single image super-resolution, revealing that transformers excel at low-frequency content but struggle with high-frequency details. It introduces CRAFT, a cross-refinement adaptive feature modulation transformer that combines a High-Frequency Enhancement Residual Block (HFERB), a Shift Rectangle Window Attention Block (SRWAB), and a Hybrid Fusion Block (HFB) to fuse high-frequency priors with global representations. To enable practical deployment, the authors propose a frequency-guided PTQ strategy with adaptive dual clipping and boundary refinement, and extend it to transformer-based SR models, achieving notable performance gains at 4-bit quantization. Empirical results on DIV2K and standard benchmarks demonstrate state-of-the-art performance in both full-precision and quantized regimes, with reduced parameter counts and improved efficiency, validating the universality and effectiveness of the frequency-guided PTQ approach.

Abstract

Transformer-based methods have exhibited remarkable potential in single image super-resolution (SISR) by effectively extracting long-range dependencies. However, most of the current research in this area has prioritized the design of transformer blocks to capture global information, while overlooking the importance of incorporating high-frequency priors, which we believe could be beneficial. In our study, we conducted a series of experiments and found that transformer structures are more adept at capturing low-frequency information, but have limited capacity in constructing high-frequency representations when compared to their convolutional counterparts. Our proposed solution, the cross-refinement adaptive feature modulation transformer (CRAFT), integrates the strengths of both convolutional and transformer structures. It comprises three key components: the high-frequency enhancement residual block (HFERB) for extracting high-frequency information, the shift rectangle window attention block (SRWAB) for capturing global information, and the hybrid fusion block (HFB) for refining the global representation. To tackle the inherent intricacies of transformer structures, we introduce a frequency-guided post-training quantization (PTQ) method aimed at enhancing CRAFT's efficiency. These strategies incorporate adaptive dual clipping and boundary refinement. To further amplify the versatility of our proposed approach, we extend our PTQ strategy to function as a general quantization method for transformer-based SISR techniques. Our experimental findings showcase CRAFT's superiority over current state-of-the-art methods, both in full-precision and quantization scenarios. These results underscore the efficacy and universality of our PTQ strategy. The source code is available at: https://github.com/AVC2-UESTC/Frequency-Inspired-Optimization-for-EfficientSR.git.
Paper Structure (36 sections, 21 equations, 14 figures, 12 tables, 4 algorithms)

This paper contains 36 sections, 21 equations, 14 figures, 12 tables, 4 algorithms.

Figures (14)

  • Figure 1: Influence of high-frequency information on the performance of CNN and transformer architectures. Dashed and solid lines correspond to CNN and transformer methods, respectively. (a) With an increase in the high-frequency drop ratio, transformer models exhibit a smaller change in PSNR compared to CNN, suggesting their superiority in capturing low-frequency information. (b) As the high-frequency drop ratio increases, transformer models show a more pronounced change in PSNR compared to CNN, indicating their limited ability to reconstruct high-frequency information from low-frequency.
  • Figure 2: Effect of frequency dropping in the image domain using a mean filter.
  • Figure 3: Framework of CRAFT. HFERB extracts the high-frequency information from the input features, SRWAB captures the long-range dependency of input features, and HFB integrates the output of HFERB and SRWAB to cross refine the global features. The reconstruction module employs a $3\times3$ convolutional layer to refine the features, and a shuffle layer Shi2016 is used to obtain the final SR output. Best viewed in color.
  • Figure 4: Illustration of asymmetric and high-dynamic phenomenon.
  • Figure 5: Demonstrating the impact of quantization on frequency and image representation. Here, we showcase the output of HFERB with 4-bit quantization to highlight the loss of high-frequency components.
  • ...and 9 more figures