Table of Contents
Fetching ...

OMGSR: You Only Need One Mid-timestep Guidance for Real-World Image Super-Resolution

Zhiqiang Wu, Zhaomang Sun, Tong Zhou, Bingtao Fu, Ji Cong, Yitong Dong, Huaqi Zhang, Xuan Tang, Mingsong Chen, Xian Wei

TL;DR

This work tackles Real-World Image Super-Resolution with diffusion priors by identifying that injecting the low-quality latent at mid-timesteps yields closer alignment to pre-trained noisy latents. It introduces a quantitative SNR-based pre-computation of the optimal mid-timestep t*, and a Latent Representation Refinement (LRR) loss via a LoRA-enhanced VAE encoder to reduce latent gaps, paired with fine-tuning of the DDPM backbone. The OMGSR framework combines a DDPM-based generator with a DINOv3-ConvNeXt discriminator and a new Dv3CD loss to improve structural fidelity across high resolutions, achieving state-of-the-art results on four Real-ISR datasets with strong qualitative and quantitative gains. It also demonstrates compatibility with Flow Matching-based generators (OMGSR-F) and provides an efficient one-step variant (OMGSR-S) that preserves detail while enabling fast inference. The accompanying code and supplementary materials support reproducibility and broader applicability to various diffusion backbones and resolutions.

Abstract

Denoising Diffusion Probabilistic Models (DDPMs) show promising potential in one-step Real-World Image Super-Resolution (Real-ISR). Current one-step Real-ISR methods typically inject the low-quality (LQ) image latent representation at the start or end timestep of the DDPM scheduler. Recent studies have begun to note that the LQ image latent and the pre-trained noisy latent representations are intuitively closer at a mid-timestep. However, a quantitative analysis of these latent representations remains lacking. Considering these latent representations can be decomposed into signal and noise, we propose a method based on the Signal-to-Noise Ratio (SNR) to pre-compute an average optimal mid-timestep for injection. To better approximate the pre-trained noisy latent representation, we further introduce the Latent Representation Refinement (LRR) loss via a LoRA-enhanced VAE encoder. We also fine-tune the backbone of the DDPM-based generative model using LoRA to perform one-step denoising at the average optimal mid-timestep. Based on these components, we present OMGSR, a GAN-based Real-ISR framework that employs a DDPM-based generative model as the generator and a DINOv3-ConvNeXt model with multi-level discriminator heads as the discriminator. We also propose the DINOv3-ConvNeXt DISTS (Dv3CD) loss, which is enhanced for structural perception at varying resolutions. Within the OMGSR framework, we develop OMGSR-S based on SD2.1-base. An ablation study confirms that our pre-computation strategy and LRR loss significantly improve the baseline. Comparative studies demonstrate that OMGSR-S achieves state-of-the-art performance across multiple metrics. Code is available at \hyperlink{Github}{https://github.com/wuer5/OMGSR}.

OMGSR: You Only Need One Mid-timestep Guidance for Real-World Image Super-Resolution

TL;DR

This work tackles Real-World Image Super-Resolution with diffusion priors by identifying that injecting the low-quality latent at mid-timesteps yields closer alignment to pre-trained noisy latents. It introduces a quantitative SNR-based pre-computation of the optimal mid-timestep t*, and a Latent Representation Refinement (LRR) loss via a LoRA-enhanced VAE encoder to reduce latent gaps, paired with fine-tuning of the DDPM backbone. The OMGSR framework combines a DDPM-based generator with a DINOv3-ConvNeXt discriminator and a new Dv3CD loss to improve structural fidelity across high resolutions, achieving state-of-the-art results on four Real-ISR datasets with strong qualitative and quantitative gains. It also demonstrates compatibility with Flow Matching-based generators (OMGSR-F) and provides an efficient one-step variant (OMGSR-S) that preserves detail while enabling fast inference. The accompanying code and supplementary materials support reproducibility and broader applicability to various diffusion backbones and resolutions.

Abstract

Denoising Diffusion Probabilistic Models (DDPMs) show promising potential in one-step Real-World Image Super-Resolution (Real-ISR). Current one-step Real-ISR methods typically inject the low-quality (LQ) image latent representation at the start or end timestep of the DDPM scheduler. Recent studies have begun to note that the LQ image latent and the pre-trained noisy latent representations are intuitively closer at a mid-timestep. However, a quantitative analysis of these latent representations remains lacking. Considering these latent representations can be decomposed into signal and noise, we propose a method based on the Signal-to-Noise Ratio (SNR) to pre-compute an average optimal mid-timestep for injection. To better approximate the pre-trained noisy latent representation, we further introduce the Latent Representation Refinement (LRR) loss via a LoRA-enhanced VAE encoder. We also fine-tune the backbone of the DDPM-based generative model using LoRA to perform one-step denoising at the average optimal mid-timestep. Based on these components, we present OMGSR, a GAN-based Real-ISR framework that employs a DDPM-based generative model as the generator and a DINOv3-ConvNeXt model with multi-level discriminator heads as the discriminator. We also propose the DINOv3-ConvNeXt DISTS (Dv3CD) loss, which is enhanced for structural perception at varying resolutions. Within the OMGSR framework, we develop OMGSR-S based on SD2.1-base. An ablation study confirms that our pre-computation strategy and LRR loss significantly improve the baseline. Comparative studies demonstrate that OMGSR-S achieves state-of-the-art performance across multiple metrics. Code is available at \hyperlink{Github}{https://github.com/wuer5/OMGSR}.

Paper Structure

This paper contains 41 sections, 23 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Illustration of a 20-step inference process with the prompt (i.e. a cute cat) in SD2.1-base DDPM. We decode the latent representations into images at each step. The LQ image is sampled from the RealESRGAN degradation pipeline.
  • Figure 2: Illustration of our OMGSR-S training and inference pipelines. OMGSR-S achieves the fastest inference speed within SD-based Real-ISR models, as the input needs to pass through the VAE encoder, one-step prediction, and VAE decoder only once.
  • Figure 3: (a) DINOv3-ConvNeXt is used to extract multi-level features, which are fed into (b) Multi-level Discriminator to obtain the logits for discrimination. Note that $\operatorname{BlurPool}$blurpool is the low-pass filter used for anti-aliasing to prevent the artifacts.
  • Figure 4: Ablation Study of OMGSR-S with different timesteps. We set the timestep $273$ as the baseline, with intervals of $100$. Note that we conduct experiments on OMGSR-S without the proposed LRR Loss to avoid any impact on the latent representation. We report a normalized score based on $9$ metrics across four datasets. The details are in the supplementary materials.
  • Figure 5: A artifact case in OMGSR-S with $\mathcal{L}_{\operatorname{LPIPS}}$vs.$\mathcal{L}_{\operatorname{Dv3CD}}$.
  • ...and 3 more figures