Table of Contents
Fetching ...

Training Transformer Models by Wavelet Losses Improves Quantitative and Visual Performance in Single Image Super-Resolution

Cansu Korkmaz, A. Murat Tekalp

TL;DR

This paper tackles single-image super-resolution with Transformer models by addressing two key gaps: limited global context in window-based attention and the inadequacy of RGB pixel losses for recovering high-frequency details. It introduces convolutional non-local sparse attention (NLSA) blocks to enlarge the receptive field around a HAT-based transformer, and a wavelet-based training loss using the Stationary Wavelet Transform (SWT) to guide high-frequency reconstruction. The combined approach, called Wavelettention, achieves state-of-the-art PSNR and enhanced visual quality on multiple benchmarks (notably Urban100, with gains up to ~0.72 dB over HAT) and also improves other Transformer SR models when trained with the SWT loss. This work demonstrates the practical benefit of wavelet-domain supervision for Transformer-based SR and provides a generic framework that can augment existing SR backbones with NLSA and SWT-based training.

Abstract

Transformer-based models have achieved remarkable results in low-level vision tasks including image super-resolution (SR). However, early Transformer-based approaches that rely on self-attention within non-overlapping windows encounter challenges in acquiring global information. To activate more input pixels globally, hybrid attention models have been proposed. Moreover, training by solely minimizing pixel-wise RGB losses, such as L1, have been found inadequate for capturing essential high-frequency details. This paper presents two contributions: i) We introduce convolutional non-local sparse attention (NLSA) blocks to extend the hybrid transformer architecture in order to further enhance its receptive field. ii) We employ wavelet losses to train Transformer models to improve quantitative and subjective performance. While wavelet losses have been explored previously, showing their power in training Transformer-based SR models is novel. Our experimental results demonstrate that the proposed model provides state-of-the-art PSNR results as well as superior visual performance across various benchmark datasets.

Training Transformer Models by Wavelet Losses Improves Quantitative and Visual Performance in Single Image Super-Resolution

TL;DR

This paper tackles single-image super-resolution with Transformer models by addressing two key gaps: limited global context in window-based attention and the inadequacy of RGB pixel losses for recovering high-frequency details. It introduces convolutional non-local sparse attention (NLSA) blocks to enlarge the receptive field around a HAT-based transformer, and a wavelet-based training loss using the Stationary Wavelet Transform (SWT) to guide high-frequency reconstruction. The combined approach, called Wavelettention, achieves state-of-the-art PSNR and enhanced visual quality on multiple benchmarks (notably Urban100, with gains up to ~0.72 dB over HAT) and also improves other Transformer SR models when trained with the SWT loss. This work demonstrates the practical benefit of wavelet-domain supervision for Transformer-based SR and provides a generic framework that can augment existing SR backbones with NLSA and SWT-based training.

Abstract

Transformer-based models have achieved remarkable results in low-level vision tasks including image super-resolution (SR). However, early Transformer-based approaches that rely on self-attention within non-overlapping windows encounter challenges in acquiring global information. To activate more input pixels globally, hybrid attention models have been proposed. Moreover, training by solely minimizing pixel-wise RGB losses, such as L1, have been found inadequate for capturing essential high-frequency details. This paper presents two contributions: i) We introduce convolutional non-local sparse attention (NLSA) blocks to extend the hybrid transformer architecture in order to further enhance its receptive field. ii) We employ wavelet losses to train Transformer models to improve quantitative and subjective performance. While wavelet losses have been explored previously, showing their power in training Transformer-based SR models is novel. Our experimental results demonstrate that the proposed model provides state-of-the-art PSNR results as well as superior visual performance across various benchmark datasets.
Paper Structure (19 sections, 2 equations, 5 figures, 5 tables)

This paper contains 19 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The proposed image SR architecture, which sandwiches HAT chen2023activating in between NLSA blocks mei_nlsa for enlarged receptive field.
  • Figure 2: Illustration of the Stationary Wavelet Transform (SWT). SWT uses low-pass and high-pass decomposition filter pairs to compute the wavelet coefficients without subsampling subbands.
  • Figure 3: Computation of pixel-wise $l_1$ loss on SWT subbands. Training hybrid transformer architectures by a weighted combination of RGB and SWT losses results in remarkable quantitative and qualitative performance improvements.
  • Figure 4: Visual comparison of SwinIR trained by $l_1$ loss Liang2021SwinIRIR vs. trained by SWT losses on images 53 & 92 from Urban100 dataset urban100_cite. Observe that training SwinIR by $l_1$ loss results in hallucinated edge directions, whereas SwinIR trained by weighted $l_1$ and SWT losses (SwinIR+SWT) recovers all structures correctly.
  • Figure 5: Visual comparison of $\times4$ SR results on images selected from Set14 set14_cite, BSD100 bsd100_cite and Urban100 urban100_cite benchmarks. Observe that SwinIR trained by $l_1$ loss only (second column) generates aliasing artifacts, while CAT shows extreme blurring on some patches of the image in the last row. Other models show moderate blurring, while our models shows the best visual results on all images.