One-Step Diffusion Transformer for Controllable Real-World Image Super-Resolution
Yushun Fang, Yuxiang Chen, Shibo Yin, Qiang Hu, Jiangchao Yao, Ya Zhang, Xiaoyun Zhang, Yanfeng Wang
TL;DR
ODTSR presents a novel one-step diffusion transformer for Real-ISR that jointly achieves fidelity and controllability by introducing a Noise-hybrid Visual Stream and Fidelity-aware Adversarial Training. By decoupling noise into a Prior Noise stream (diffusion priors) and a Control Noise stream (fidelity-guided modulation), ODTSR enables prompt-guided restorations without extensive fine-tuning. The approach demonstrates state-of-the-art performance on real-world datasets and strong generalization to challenging sub-domains like Chinese text, while supporting bilingual prompts. Ablations and user studies corroborate the effectiveness of NVS and FAA in balancing fidelity with controllable image generation. The work offers practical advancements for efficient, controllable Real-ISR in real-world applications.
Abstract
Recent advances in diffusion-based real-world image super-resolution (Real-ISR) have demonstrated remarkable perceptual quality, yet the balance between fidelity and controllability remains a problem: multi-step diffusion-based methods suffer from generative diversity and randomness, resulting in low fidelity, while one-step methods lose control flexibility due to fidelity-specific finetuning. In this paper, we present ODTSR, a one-step diffusion transformer based on Qwen-Image that performs Real-ISR considering fidelity and controllability simultaneously: a newly introduced visual stream receives low-quality images (LQ) with adjustable noise (Control Noise), and the original visual stream receives LQs with consistent noise (Prior Noise), forming the Noise-hybrid Visual Stream (NVS) design. ODTSR further employs Fidelity-aware Adversarial Training (FAA) to enhance controllability and achieve one-step inference. Extensive experiments demonstrate that ODTSR not only achieves state-of-the-art (SOTA) performance on generic Real-ISR, but also enables prompt controllability on challenging scenarios such as real-world scene text image super-resolution (STISR) of Chinese characters without training on specific datasets. Codes are available at https://github.com/RedMediaTech/ODTSR.
