OMGSR: You Only Need One Mid-timestep Guidance for Real-World Image Super-Resolution
Zhiqiang Wu, Zhaomang Sun, Tong Zhou, Bingtao Fu, Ji Cong, Yitong Dong, Huaqi Zhang, Xuan Tang, Mingsong Chen, Xian Wei
TL;DR
This work tackles Real-World Image Super-Resolution with diffusion priors by identifying that injecting the low-quality latent at mid-timesteps yields closer alignment to pre-trained noisy latents. It introduces a quantitative SNR-based pre-computation of the optimal mid-timestep t*, and a Latent Representation Refinement (LRR) loss via a LoRA-enhanced VAE encoder to reduce latent gaps, paired with fine-tuning of the DDPM backbone. The OMGSR framework combines a DDPM-based generator with a DINOv3-ConvNeXt discriminator and a new Dv3CD loss to improve structural fidelity across high resolutions, achieving state-of-the-art results on four Real-ISR datasets with strong qualitative and quantitative gains. It also demonstrates compatibility with Flow Matching-based generators (OMGSR-F) and provides an efficient one-step variant (OMGSR-S) that preserves detail while enabling fast inference. The accompanying code and supplementary materials support reproducibility and broader applicability to various diffusion backbones and resolutions.
Abstract
Denoising Diffusion Probabilistic Models (DDPMs) show promising potential in one-step Real-World Image Super-Resolution (Real-ISR). Current one-step Real-ISR methods typically inject the low-quality (LQ) image latent representation at the start or end timestep of the DDPM scheduler. Recent studies have begun to note that the LQ image latent and the pre-trained noisy latent representations are intuitively closer at a mid-timestep. However, a quantitative analysis of these latent representations remains lacking. Considering these latent representations can be decomposed into signal and noise, we propose a method based on the Signal-to-Noise Ratio (SNR) to pre-compute an average optimal mid-timestep for injection. To better approximate the pre-trained noisy latent representation, we further introduce the Latent Representation Refinement (LRR) loss via a LoRA-enhanced VAE encoder. We also fine-tune the backbone of the DDPM-based generative model using LoRA to perform one-step denoising at the average optimal mid-timestep. Based on these components, we present OMGSR, a GAN-based Real-ISR framework that employs a DDPM-based generative model as the generator and a DINOv3-ConvNeXt model with multi-level discriminator heads as the discriminator. We also propose the DINOv3-ConvNeXt DISTS (Dv3CD) loss, which is enhanced for structural perception at varying resolutions. Within the OMGSR framework, we develop OMGSR-S based on SD2.1-base. An ablation study confirms that our pre-computation strategy and LRR loss significantly improve the baseline. Comparative studies demonstrate that OMGSR-S achieves state-of-the-art performance across multiple metrics. Code is available at \hyperlink{Github}{https://github.com/wuer5/OMGSR}.
