Table of Contents
Fetching ...

You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation

Mehdi Noroozi, Isma Hadji, Brais Martinez, Adrian Bulat, Georgios Tzimiropoulos

TL;DR

The paper addresses the computational bottleneck of diffusion-based super-resolution by introducing scale distillation, which progressively trains teacher and student models across increasing magnifications to provide noise-adaptive supervision. When combined with decoder fine-tuning on a frozen one-step diffusion backbone, YONOS-SR achieves state-of-the-art SR quality with a single inference step, significantly reducing inference cost. The three core contributions are (i) scale distillation for accurate, scale-aware supervision across denoising steps, (ii) demonstrating that 1-step diffusion solutions can be viable for high-fidelity SR when paired with targeted decoder fine-tuning, and (iii) extensive experiments showing superior performance over 200-step diffusion SR baselines on both synthetic and real degradation pipelines. This approach offers practical speedups for real-image SR and has potential applicability to other inverse-imaging tasks such as inpainting or deblurring.

Abstract

In this paper, we introduce YONOS-SR, a novel stable diffusion-based approach for image super-resolution that yields state-of-the-art results using only a single DDIM step. We propose a novel scale distillation approach to train our SR model. Instead of directly training our SR model on the scale factor of interest, we start by training a teacher model on a smaller magnification scale, thereby making the SR problem simpler for the teacher. We then train a student model for a higher magnification scale, using the predictions of the teacher as a target during the training. This process is repeated iteratively until we reach the target scale factor of the final model. The rationale behind our scale distillation is that the teacher aids the student diffusion model training by i) providing a target adapted to the current noise level rather than using the same target coming from ground truth data for all noise levels and ii) providing an accurate target as the teacher has a simpler task to solve. We empirically show that the distilled model significantly outperforms the model trained for high scales directly, specifically with few steps during inference. Having a strong diffusion model that requires only one step allows us to freeze the U-Net and fine-tune the decoder on top of it. We show that the combination of spatially distilled U-Net and fine-tuned decoder outperforms state-of-the-art methods requiring 200 steps with only one single step.

You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation

TL;DR

The paper addresses the computational bottleneck of diffusion-based super-resolution by introducing scale distillation, which progressively trains teacher and student models across increasing magnifications to provide noise-adaptive supervision. When combined with decoder fine-tuning on a frozen one-step diffusion backbone, YONOS-SR achieves state-of-the-art SR quality with a single inference step, significantly reducing inference cost. The three core contributions are (i) scale distillation for accurate, scale-aware supervision across denoising steps, (ii) demonstrating that 1-step diffusion solutions can be viable for high-fidelity SR when paired with targeted decoder fine-tuning, and (iii) extensive experiments showing superior performance over 200-step diffusion SR baselines on both synthetic and real degradation pipelines. This approach offers practical speedups for real-image SR and has potential applicability to other inverse-imaging tasks such as inpainting or deblurring.

Abstract

In this paper, we introduce YONOS-SR, a novel stable diffusion-based approach for image super-resolution that yields state-of-the-art results using only a single DDIM step. We propose a novel scale distillation approach to train our SR model. Instead of directly training our SR model on the scale factor of interest, we start by training a teacher model on a smaller magnification scale, thereby making the SR problem simpler for the teacher. We then train a student model for a higher magnification scale, using the predictions of the teacher as a target during the training. This process is repeated iteratively until we reach the target scale factor of the final model. The rationale behind our scale distillation is that the teacher aids the student diffusion model training by i) providing a target adapted to the current noise level rather than using the same target coming from ground truth data for all noise levels and ii) providing an accurate target as the teacher has a simpler task to solve. We empirically show that the distilled model significantly outperforms the model trained for high scales directly, specifically with few steps during inference. Having a strong diffusion model that requires only one step allows us to freeze the U-Net and fine-tune the decoder on top of it. We show that the combination of spatially distilled U-Net and fine-tuned decoder outperforms state-of-the-art methods requiring 200 steps with only one single step.
Paper Structure (21 sections, 3 equations, 62 figures, 4 tables, 1 algorithm)

This paper contains 21 sections, 3 equations, 62 figures, 4 tables, 1 algorithm.

Figures (62)

  • Figure 1: Qualitative comparison for $\times 4$ and $\times 8$ magnifications. Each column shows top to bottom LR input image, 1 and 200 step SD-SR, and 1-step YONOS-SR(ours). SD-SR represents the standard Stable Diffusion-based SR model, whereas YONOS-SR is our method trained using the same data and parameterization. The 1-step SD-SR method lacks quality in terms of detailed textures compared to 200-steps of the same model; see building texture in the first column and hairs in the middle column. In contrast, our proposed method outperforms 200-steps SD-SR with only one step specifically for $\times 8$ magnification where SD-SR fails to recover the details even with 200 steps. Samples are taken from DIV2K bicubic validation set. The images are best seen in a display and zoomed in.
  • Figure 2: Training pipeline of proposed scale distillation. For a given HR image (e.g. size $512\times 512$) shown in green, we generate two degraded versions with factors of $2/N, 1/N$ (e.g. sizes $256\times256$ and $128\times128$), shown in yellow and red respectively. Both degraded images are resized back via bicubic upsampling to $512 \times 512$ to be used as input to the encoder, which projects them to $4\times 64 \times 64$ tensors. The less and more degraded LR image is used as input to the teacher and student respectively via concatenation with the noisy version of the HR image, i.e.$\mathbf{z}_t$. The teacher's output is used as the target for training the student. Note that the teacher is first trained independently for a smaller magnification scale and then frozen during student training.
  • Figure 3: FID vs. number of DDIM steps on the DIV2K validation set obtained through bicubic degradation for $\times 4$ and $\times 8$ magnifications. We use $\times 2 \rightarrow \times 4$ scale distillation for $\times 4$ and $\times 2 \rightarrow \times 4 \rightarrow \times 8$ for $\times 8$ magnification, and compare with the standard training directly for $\times 4$ and $\times 8$ respectively. All results are obtained using the original SD decoder. The model trained with scale distillation outperforms the standard training with large margin when using fewer steps for $\times 4$. The gap between scale distillation and the standard training is significantly higher for $\times 8$ and remains steady for large numbers steps as well.
  • Figure 4: Qualitative comparison on the validation set of DIV2K bicubic degradation dataset: (a) 200-step StableSR (b) 200-step standard SD-SR (c) 1-step YONOS(ours) (d) the ground truth. SD-SR represents the standard Stable Diffusion-based SR model. 200-step StableSR and SD-SR tend to over-sharpen, adding artifacts that do not match the ground truth content. Our SR images match the most with the corresponding ground truth image; see the faces, Pepsi, and crocodile textures in the first, second, and third rows, respectively. The images are best seen in a display and zoomed in.
  • Figure 5: Qualitative comparison on the validation set of DIV2K bicubic degradation dataset for $\times 8$ magnification when the model is trained directly for $\times 8$ magnification without scale distillation (top row) and with three iterations of scale distillation $\times 2 \rightarrow \times 4 \rightarrow \times 8$ (bottom row). We show the input LR image, the corresponding HR image, and results with 1, 4, and 64 steps using the original decoder for both models. The model trained with scale distillation outperforms the standard training with high margins. Specifically, due to poor LR input, the standard training fails to recover the relevant content. The images are best seen in a display and zoomed in.
  • ...and 57 more figures