C2D-ISR: Optimizing Attention-based Image Super-resolution from Continuous to Discrete Scales
Yuxuan Jiang, Chengxi Zeng, Siyue Teng, Fan Zhang, Xiaoqing Zhu, Joel Sole, David Bull
TL;DR
C2D-ISR addresses the trade-off between performance and complexity in attention-based single image super-resolution by learning inter-scale correlations through continuous-scale pre-training and then deploying a discrete, efficient up-sampler. The approach introduces the Hierarchical Encoding Transformer (HiET) with a modified U-Net architecture and a two-stage training pipeline: (i) continuous-scale pre-training using an implicit image function HIIF-L to model arbitrary scales, and (ii) discrete-scale fine-tuning with a conventional up-sampler for fixed scales. Applied to SwinIR-L, SRFormer-L, and MambaIRv2-L, C2D-ISR achieves PSNR gains up to $0.2$ dB and reduces FLOPs by up to $11\%$ compared to HiT, across multiple datasets and SR scales, while maintaining or improving SSIM. The framework enables faster inference and better multi-scale feature fusion, with source code to be released, representing a practical advancement for real-time, high-quality SR.
Abstract
In recent years, attention mechanisms have been exploited in single image super-resolution (SISR), achieving impressive reconstruction results. However, these advancements are still limited by the reliance on simple training strategies and network architectures designed for discrete up-sampling scales, which hinder the model's ability to effectively capture information across multiple scales. To address these limitations, we propose a novel framework, \textbf{C2D-ISR}, for optimizing attention-based image super-resolution models from both performance and complexity perspectives. Our approach is based on a two-stage training methodology and a hierarchical encoding mechanism. The new training methodology involves continuous-scale training for discrete scale models, enabling the learning of inter-scale correlations and multi-scale feature representation. In addition, we generalize the hierarchical encoding mechanism with existing attention-based network structures, which can achieve improved spatial feature fusion, cross-scale information aggregation, and more importantly, much faster inference. We have evaluated the C2D-ISR framework based on three efficient attention-based backbones, SwinIR-L, SRFormer-L and MambaIRv2-L, and demonstrated significant improvements over the other existing optimization framework, HiT, in terms of super-resolution performance (up to 0.2dB) and computational complexity reduction (up to 11%). The source code will be made publicly available at www.github.com.
