Table of Contents
Fetching ...

One-Step Diffusion Transformer for Controllable Real-World Image Super-Resolution

Yushun Fang, Yuxiang Chen, Shibo Yin, Qiang Hu, Jiangchao Yao, Ya Zhang, Xiaoyun Zhang, Yanfeng Wang

TL;DR

ODTSR presents a novel one-step diffusion transformer for Real-ISR that jointly achieves fidelity and controllability by introducing a Noise-hybrid Visual Stream and Fidelity-aware Adversarial Training. By decoupling noise into a Prior Noise stream (diffusion priors) and a Control Noise stream (fidelity-guided modulation), ODTSR enables prompt-guided restorations without extensive fine-tuning. The approach demonstrates state-of-the-art performance on real-world datasets and strong generalization to challenging sub-domains like Chinese text, while supporting bilingual prompts. Ablations and user studies corroborate the effectiveness of NVS and FAA in balancing fidelity with controllable image generation. The work offers practical advancements for efficient, controllable Real-ISR in real-world applications.

Abstract

Recent advances in diffusion-based real-world image super-resolution (Real-ISR) have demonstrated remarkable perceptual quality, yet the balance between fidelity and controllability remains a problem: multi-step diffusion-based methods suffer from generative diversity and randomness, resulting in low fidelity, while one-step methods lose control flexibility due to fidelity-specific finetuning. In this paper, we present ODTSR, a one-step diffusion transformer based on Qwen-Image that performs Real-ISR considering fidelity and controllability simultaneously: a newly introduced visual stream receives low-quality images (LQ) with adjustable noise (Control Noise), and the original visual stream receives LQs with consistent noise (Prior Noise), forming the Noise-hybrid Visual Stream (NVS) design. ODTSR further employs Fidelity-aware Adversarial Training (FAA) to enhance controllability and achieve one-step inference. Extensive experiments demonstrate that ODTSR not only achieves state-of-the-art (SOTA) performance on generic Real-ISR, but also enables prompt controllability on challenging scenarios such as real-world scene text image super-resolution (STISR) of Chinese characters without training on specific datasets. Codes are available at https://github.com/RedMediaTech/ODTSR.

One-Step Diffusion Transformer for Controllable Real-World Image Super-Resolution

TL;DR

ODTSR presents a novel one-step diffusion transformer for Real-ISR that jointly achieves fidelity and controllability by introducing a Noise-hybrid Visual Stream and Fidelity-aware Adversarial Training. By decoupling noise into a Prior Noise stream (diffusion priors) and a Control Noise stream (fidelity-guided modulation), ODTSR enables prompt-guided restorations without extensive fine-tuning. The approach demonstrates state-of-the-art performance on real-world datasets and strong generalization to challenging sub-domains like Chinese text, while supporting bilingual prompts. Ablations and user studies corroborate the effectiveness of NVS and FAA in balancing fidelity with controllable image generation. The work offers practical advancements for efficient, controllable Real-ISR in real-world applications.

Abstract

Recent advances in diffusion-based real-world image super-resolution (Real-ISR) have demonstrated remarkable perceptual quality, yet the balance between fidelity and controllability remains a problem: multi-step diffusion-based methods suffer from generative diversity and randomness, resulting in low fidelity, while one-step methods lose control flexibility due to fidelity-specific finetuning. In this paper, we present ODTSR, a one-step diffusion transformer based on Qwen-Image that performs Real-ISR considering fidelity and controllability simultaneously: a newly introduced visual stream receives low-quality images (LQ) with adjustable noise (Control Noise), and the original visual stream receives LQs with consistent noise (Prior Noise), forming the Noise-hybrid Visual Stream (NVS) design. ODTSR further employs Fidelity-aware Adversarial Training (FAA) to enhance controllability and achieve one-step inference. Extensive experiments demonstrate that ODTSR not only achieves state-of-the-art (SOTA) performance on generic Real-ISR, but also enables prompt controllability on challenging scenarios such as real-world scene text image super-resolution (STISR) of Chinese characters without training on specific datasets. Codes are available at https://github.com/RedMediaTech/ODTSR.

Paper Structure

This paper contains 51 sections, 16 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Controllable Real-ISR: Qualitative results of our ODTSR and other state-of-the-art methods. Our method achieves superior quality, supports flexible bilingual prompt controllability, and covers challenging sub-domains such as Chinese text images, fine-grained texture and face images. "f" denotes controllable Fidelity Weight in ODTSR. More results are shown in the supplementary materials.
  • Figure 2: Effects of Noise on Fidelity and Controllability. Based on pretrained Qwen-Image, (a) shows the denoising results of ground-truth (GT) under the same prompt and different levels of noise. (b) adopts low noise level and different prompts (with Chinese annotation). (c) studies the effects of noise on LQ with the same prompt. High-noise inputs improve perceptual quality and controllability but reduce fidelity, whereas low-noise inputs preserve original details yet fail to deliver enhanced super-resolution effects.
  • Figure 3: The scheduler used in Qwen-Image. The horizontal axis is the timestep ranging from 0 to 999, 1000 discrete timesteps in total, and the vertical axis represents the values of $t$. During pre-training, timestep is uniformly sampled to obtain $t$.
  • Figure 4: User Study Interface.
  • Figure 5: The model contains 60 transformer layers in total. The details of a single transformer layer are shown here: the left branch corresponds to the Text stream, the middle to the Prior Noise stream, and the right to the Control Noise stream. Among them, only the linear layers in the Control Noise stream are trained with LoRA, while all other parameters remain frozen.
  • ...and 5 more figures