Table of Contents
Fetching ...

HiTSR: A Hierarchical Transformer for Reference-based Super-Resolution

Masoomeh Aslahishahri, Jordan Ubbens, Ian Stavness

TL;DR

HiTSR tackles reference-based image super-resolution by learning joint representations across LR input and HR reference images using a hierarchical Swin Transformer with double attention. It integrates SE-based global context and long skip connections, enabling an end-to-end single-network training without multiple subnetworks or distillation. On datasets such as SUN80, Urban100, and Manga109, HiTSR achieves state-of-the-art performance under $L_1$-based objectives and competitive results with other loss configurations, while showing robustness to scale and rotation in reference images. The work demonstrates the effectiveness of cross-distribution attention for texture transfer and SR quality, suggesting a practical, scalable alternative to more complex Ref-SR pipelines.

Abstract

In this paper, we propose HiTSR, a hierarchical transformer model for reference-based image super-resolution, which enhances low-resolution input images by learning matching correspondences from high-resolution reference images. Diverging from existing multi-network, multi-stage approaches, we streamline the architecture and training pipeline by incorporating the double attention block from GAN literature. Processing two visual streams independently, we fuse self-attention and cross-attention blocks through a gating attention strategy. The model integrates a squeeze-and-excitation module to capture global context from the input images, facilitating long-range spatial interactions within window-based attention blocks. Long skip connections between shallow and deep layers further enhance information flow. Our model demonstrates superior performance across three datasets including SUN80, Urban100, and Manga109. Specifically, on the SUN80 dataset, our model achieves PSNR/SSIM values of 30.24/0.821. These results underscore the effectiveness of attention mechanisms in reference-based image super-resolution. The transformer-based model attains state-of-the-art results without the need for purpose-built subnetworks, knowledge distillation, or multi-stage training, emphasizing the potency of attention in meeting reference-based image super-resolution requirements.

HiTSR: A Hierarchical Transformer for Reference-based Super-Resolution

TL;DR

HiTSR tackles reference-based image super-resolution by learning joint representations across LR input and HR reference images using a hierarchical Swin Transformer with double attention. It integrates SE-based global context and long skip connections, enabling an end-to-end single-network training without multiple subnetworks or distillation. On datasets such as SUN80, Urban100, and Manga109, HiTSR achieves state-of-the-art performance under -based objectives and competitive results with other loss configurations, while showing robustness to scale and rotation in reference images. The work demonstrates the effectiveness of cross-distribution attention for texture transfer and SR quality, suggesting a practical, scalable alternative to more complex Ref-SR pipelines.

Abstract

In this paper, we propose HiTSR, a hierarchical transformer model for reference-based image super-resolution, which enhances low-resolution input images by learning matching correspondences from high-resolution reference images. Diverging from existing multi-network, multi-stage approaches, we streamline the architecture and training pipeline by incorporating the double attention block from GAN literature. Processing two visual streams independently, we fuse self-attention and cross-attention blocks through a gating attention strategy. The model integrates a squeeze-and-excitation module to capture global context from the input images, facilitating long-range spatial interactions within window-based attention blocks. Long skip connections between shallow and deep layers further enhance information flow. Our model demonstrates superior performance across three datasets including SUN80, Urban100, and Manga109. Specifically, on the SUN80 dataset, our model achieves PSNR/SSIM values of 30.24/0.821. These results underscore the effectiveness of attention mechanisms in reference-based image super-resolution. The transformer-based model attains state-of-the-art results without the need for purpose-built subnetworks, knowledge distillation, or multi-stage training, emphasizing the potency of attention in meeting reference-based image super-resolution requirements.
Paper Structure (14 sections, 5 equations, 6 figures, 4 tables)

This paper contains 14 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of HiTSR. Two input images drive the proposed architecture: LR features, extracted by a deep feature extraction (FE) module containing a 3-block SE layer, and HR reference features, extracted from a pre-trained VGG-based feature extractor. These features undergo further processing through the SE module, with varying depths in layers corresponding to the size of the input image. The sequential arrangement of "b1" followed by "b2" blocks incorporates self- and cross-attention modules. In the self-attention module, query and key matrices $Q_{LR}$ and $K_{LR}$ are derived from LR image features. The input query $Q_{ref}$ is sourced from the HR reference image via the SE module for the cross-attention module. Gating attention is applied after the computation of self- and cross-attention matrices within each transformer block. An additional conv-based residual module, termed "PAR", acts as a post-attention residual block. To traverse various spatial resolutions, we integrate upsampling and downsampling layers. The combination of "b1" and "b2" is repeated three times. *: the downsampler module is disabled in the final "b2" block.
  • Figure 2: a) Hierarchical transformer blocks are connected via LSCs between shallow and deep layers, using concatenation to link layers with matching dimensions. The last transformer block (the $3^{rd}$$b2$) does not include the downsampler module. b) Gating attention balances self- and cross-attention matrices within each transformer block, modulated by a gating parameter $\lambda$.
  • Figure 3: Qualitative comparisons, for example, images from CUFED5 and the Webly-Ref-SR datasets. Top row: target image, bicubic, ESRGAN, ENet results. The bottom row: reference image, SRNTT, $C^2$-Matching, and our HiTSR outcomes. All methods utilize $l_1$ loss, perceptual loss, and GAN loss objectives.
  • Figure 4: Qualitative comparisons of our results on Urban100, Sun80, and Manga109 datasets with $C^2$-Matching. Each method incorporates $l_1$ loss, perceptual loss, and GAN loss objectives.
  • Figure 5: Qualitative comparisons from ablation studies: (a) with all components intact, and (j) after excluding SE, PAR, and LSCs modules, evaluated on CUFED and Webly-Referenced SR datasets.
  • ...and 1 more figures