ML-CrAIST: Multi-scale Low-high Frequency Information-based Cross black Attention with Image Super-resolving Transformer
Alik Pramanick, Utsav Bheda, Arijit Sur
TL;DR
This work introduces ML-CrAIST, a transformer-based single-image super-resolution architecture that jointly exploits multi-scale low/high-frequency information through 2D Discrete Wavelet Transform and cross-frequency cross-attention. Core components include the spatial-channel attention-based transformer block (SCATB), the low-high frequency interaction block (LHFIB) with an attention-based fusion block (AFB), and cross attention blocks (CAB) for cross-scale and cross-frequency message passing. Empirical results on five standard SR benchmarks show state-of-the-art PSNR/SSIM and favorable perceptual metrics, with notable gains such as $+0.20$ dB on Manga109 ×3 and competitive FLOPs, including a lighter variant (Ours-Li) with ~1.5× fewer FLOPs. The method also yields practical benefits in downstream tasks like keypoint detection and edge detection, validating its broader applicability in image restoration and analysis.
Abstract
Recently, transformers have captured significant interest in the area of single-image super-resolution tasks, demonstrating substantial gains in performance. Current models heavily depend on the network's extensive ability to extract high-level semantic details from images while overlooking the effective utilization of multi-scale image details and intermediate information within the network. Furthermore, it has been observed that high-frequency areas in images present significant complexity for super-resolution compared to low-frequency areas. This work proposes a transformer-based super-resolution architecture called ML-CrAIST that addresses this gap by utilizing low-high frequency information in multiple scales. Unlike most of the previous work (either spatial or channel), we operate spatial and channel self-attention, which concurrently model pixel interaction from both spatial and channel dimensions, exploiting the inherent correlations across spatial and channel axis. Further, we devise a cross-attention block for super-resolution, which explores the correlations between low and high-frequency information. Quantitative and qualitative assessments indicate that our proposed ML-CrAIST surpasses state-of-the-art super-resolution methods (e.g., 0.15 dB gain @Manga109 $\times$4). Code is available on: https://github.com/Alik033/ML-CrAIST.
