Table of Contents
Fetching ...

C2D-ISR: Optimizing Attention-based Image Super-resolution from Continuous to Discrete Scales

Yuxuan Jiang, Chengxi Zeng, Siyue Teng, Fan Zhang, Xiaoqing Zhu, Joel Sole, David Bull

TL;DR

C2D-ISR addresses the trade-off between performance and complexity in attention-based single image super-resolution by learning inter-scale correlations through continuous-scale pre-training and then deploying a discrete, efficient up-sampler. The approach introduces the Hierarchical Encoding Transformer (HiET) with a modified U-Net architecture and a two-stage training pipeline: (i) continuous-scale pre-training using an implicit image function HIIF-L to model arbitrary scales, and (ii) discrete-scale fine-tuning with a conventional up-sampler for fixed scales. Applied to SwinIR-L, SRFormer-L, and MambaIRv2-L, C2D-ISR achieves PSNR gains up to $0.2$ dB and reduces FLOPs by up to $11\%$ compared to HiT, across multiple datasets and SR scales, while maintaining or improving SSIM. The framework enables faster inference and better multi-scale feature fusion, with source code to be released, representing a practical advancement for real-time, high-quality SR.

Abstract

In recent years, attention mechanisms have been exploited in single image super-resolution (SISR), achieving impressive reconstruction results. However, these advancements are still limited by the reliance on simple training strategies and network architectures designed for discrete up-sampling scales, which hinder the model's ability to effectively capture information across multiple scales. To address these limitations, we propose a novel framework, \textbf{C2D-ISR}, for optimizing attention-based image super-resolution models from both performance and complexity perspectives. Our approach is based on a two-stage training methodology and a hierarchical encoding mechanism. The new training methodology involves continuous-scale training for discrete scale models, enabling the learning of inter-scale correlations and multi-scale feature representation. In addition, we generalize the hierarchical encoding mechanism with existing attention-based network structures, which can achieve improved spatial feature fusion, cross-scale information aggregation, and more importantly, much faster inference. We have evaluated the C2D-ISR framework based on three efficient attention-based backbones, SwinIR-L, SRFormer-L and MambaIRv2-L, and demonstrated significant improvements over the other existing optimization framework, HiT, in terms of super-resolution performance (up to 0.2dB) and computational complexity reduction (up to 11%). The source code will be made publicly available at www.github.com.

C2D-ISR: Optimizing Attention-based Image Super-resolution from Continuous to Discrete Scales

TL;DR

C2D-ISR addresses the trade-off between performance and complexity in attention-based single image super-resolution by learning inter-scale correlations through continuous-scale pre-training and then deploying a discrete, efficient up-sampler. The approach introduces the Hierarchical Encoding Transformer (HiET) with a modified U-Net architecture and a two-stage training pipeline: (i) continuous-scale pre-training using an implicit image function HIIF-L to model arbitrary scales, and (ii) discrete-scale fine-tuning with a conventional up-sampler for fixed scales. Applied to SwinIR-L, SRFormer-L, and MambaIRv2-L, C2D-ISR achieves PSNR gains up to dB and reduces FLOPs by up to compared to HiT, across multiple datasets and SR scales, while maintaining or improving SSIM. The framework enables faster inference and better multi-scale feature fusion, with source code to be released, representing a practical advancement for real-time, high-quality SR.

Abstract

In recent years, attention mechanisms have been exploited in single image super-resolution (SISR), achieving impressive reconstruction results. However, these advancements are still limited by the reliance on simple training strategies and network architectures designed for discrete up-sampling scales, which hinder the model's ability to effectively capture information across multiple scales. To address these limitations, we propose a novel framework, \textbf{C2D-ISR}, for optimizing attention-based image super-resolution models from both performance and complexity perspectives. Our approach is based on a two-stage training methodology and a hierarchical encoding mechanism. The new training methodology involves continuous-scale training for discrete scale models, enabling the learning of inter-scale correlations and multi-scale feature representation. In addition, we generalize the hierarchical encoding mechanism with existing attention-based network structures, which can achieve improved spatial feature fusion, cross-scale information aggregation, and more importantly, much faster inference. We have evaluated the C2D-ISR framework based on three efficient attention-based backbones, SwinIR-L, SRFormer-L and MambaIRv2-L, and demonstrated significant improvements over the other existing optimization framework, HiT, in terms of super-resolution performance (up to 0.2dB) and computational complexity reduction (up to 11%). The source code will be made publicly available at www.github.com.

Paper Structure

This paper contains 11 sections, 10 equations, 32 figures, 2 tables.

Figures (32)

  • Figure 1: Different ISR training strategies: (a) the conventional training methodology liang2021swinirguo2024mambairdong2015image only involving training models at a fixed scale; (b) the training method based on discrete multiple scales, which can relatively improve single scale ISR performance kim2016accuratelim2017enhancedlai2017deep; (c) the proposed C2D training strategy, which employs a new implicit image function to learn inter-scale correlations from continuous ISR models - this strategy maintains low computational complexity and enhances the overall performance. $u_{C}$ denotes the continuous up-sampler, while $u_{D}$ represents the discrete up-sampler. $s$ and $s'$ are the up-sampling scales, and $\sim$ stands for continuous scales.
  • Figure 2: The architecture of the proposed C2D-ISR framework. (Top) the overall architecture and the design of the HiET block. The hyperparameter $B$ stands for the number of HiET blocks in the deep feature extractor $f_{D}$. (Middle) the design of the HiET layer. (Bottom) the up-sampler used for continuous-scale and discrete-scale training.
  • Figure 4: The illustration of the local attribution maps (LAM) comparisons by using the Local Attribution Maps toolgu2021interpreting.
  • Figure 5: Complexity-performance trade-off visualization for selected ISR methods. The results are based on the Urban100 dataset and $\times$4 task.
  • Figure : 78004 (BSD100$, \times$4)
  • ...and 27 more figures