Training Transformer Models by Wavelet Losses Improves Quantitative and Visual Performance in Single Image Super-Resolution
Cansu Korkmaz, A. Murat Tekalp
TL;DR
This paper tackles single-image super-resolution with Transformer models by addressing two key gaps: limited global context in window-based attention and the inadequacy of RGB pixel losses for recovering high-frequency details. It introduces convolutional non-local sparse attention (NLSA) blocks to enlarge the receptive field around a HAT-based transformer, and a wavelet-based training loss using the Stationary Wavelet Transform (SWT) to guide high-frequency reconstruction. The combined approach, called Wavelettention, achieves state-of-the-art PSNR and enhanced visual quality on multiple benchmarks (notably Urban100, with gains up to ~0.72 dB over HAT) and also improves other Transformer SR models when trained with the SWT loss. This work demonstrates the practical benefit of wavelet-domain supervision for Transformer-based SR and provides a generic framework that can augment existing SR backbones with NLSA and SWT-based training.
Abstract
Transformer-based models have achieved remarkable results in low-level vision tasks including image super-resolution (SR). However, early Transformer-based approaches that rely on self-attention within non-overlapping windows encounter challenges in acquiring global information. To activate more input pixels globally, hybrid attention models have been proposed. Moreover, training by solely minimizing pixel-wise RGB losses, such as L1, have been found inadequate for capturing essential high-frequency details. This paper presents two contributions: i) We introduce convolutional non-local sparse attention (NLSA) blocks to extend the hybrid transformer architecture in order to further enhance its receptive field. ii) We employ wavelet losses to train Transformer models to improve quantitative and subjective performance. While wavelet losses have been explored previously, showing their power in training Transformer-based SR models is novel. Our experimental results demonstrate that the proposed model provides state-of-the-art PSNR results as well as superior visual performance across various benchmark datasets.
