Diffusion Transformer meets Multi-level Wavelet Spectrum for Single Image Super-Resolution
Peng Du, Hui Li, Han Xu, Paul Barom Jeon, Dongwook Lee, Daehyun Ji, Ran Yang, Feng Zhu
TL;DR
This work addresses single-image super-resolution by explicitly modeling inter-scale relationships among multi-level wavelet sub-bands. It introduces a conditional diffusion-transformer framework, DTWSR, that operates in the wavelet-spectrum domain and uses pyramid tokenization plus a dual-decoder network to denoise low- and high-frequency components while aligning their sub-bands. Key contributions include the Wavelet Spectrum Denoising Network (WSDT) with LEDec and HDDec decoders and a pyramid-token-based embedding, enabling efficient learning of cross-band dependencies. Extensive experiments on face and general SR benchmarks demonstrate state-of-the-art performance in both objective fidelity and perceptual quality, highlighting the practical impact for high-fidelity image restoration.
Abstract
Discrete Wavelet Transform (DWT) has been widely explored to enhance the performance of image superresolution (SR). Despite some DWT-based methods improving SR by capturing fine-grained frequency signals, most existing approaches neglect the interrelations among multiscale frequency sub-bands, resulting in inconsistencies and unnatural artifacts in the reconstructed images. To address this challenge, we propose a Diffusion Transformer model based on image Wavelet spectra for SR (DTWSR). DTWSR incorporates the superiority of diffusion models and transformers to capture the interrelations among multiscale frequency sub-bands, leading to a more consistence and realistic SR image. Specifically, we use a Multi-level Discrete Wavelet Transform to decompose images into wavelet spectra. A pyramid tokenization method is proposed which embeds the spectra into a sequence of tokens for transformer model, facilitating to capture features from both spatial and frequency domain. A dual-decoder is designed elaborately to handle the distinct variances in low-frequency and high-frequency sub-bands, without omitting their alignment in image generation. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our method, with high performance on both perception quality and fidelity.
