Table of Contents
Fetching ...

Improving the Stability and Efficiency of Diffusion Models for Content Consistent Super-Resolution

Lingchen Sun, Rongyuan Wu, Jie Liang, Zhengqiang Zhang, Hongwei Yong, Lei Zhang

TL;DR

This work tackles the instability and limited control of diffusion-model-based super-resolution by introducing Content Consistent Super-Resolution (CCSR), a two-stage framework that separates structure generation from detail refinement. The first stage uses non-uniform timestep sampling in a diffusion process to extract coherent image structures from a low-resolution input, while the second stage fine-tunes a VAE decoder via adversarial training to deterministically enhance high-frequency details, enabling single-step or multi-step diffusion during inference. The authors show that CCSR improves content fidelity and perceptual quality and dramatically reduces stochastic variation across runs, as evidenced by new stability metrics and comprehensive experiments against both standard and efficient DM-based SR methods. The approach delivers a flexible, efficient SR solution that remains robust under real-world degradations and varying perception-fidelity requirements.

Abstract

The generative priors of pre-trained latent diffusion models (DMs) have demonstrated great potential to enhance the visual quality of image super-resolution (SR) results. However, the noise sampling process in DMs introduces randomness in the SR outputs, and the generated contents can differ a lot with different noise samples. The multi-step diffusion process can be accelerated by distilling methods, but the generative capacity is difficult to control. To address these issues, we analyze the respective advantages of DMs and generative adversarial networks (GANs) and propose to partition the generative SR process into two stages, where the DM is employed for reconstructing image structures and the GAN is employed for improving fine-grained details. Specifically, we propose a non-uniform timestep sampling strategy in the first stage. A single timestep sampling is first applied to extract the coarse information from the input image, then a few reverse steps are used to reconstruct the main structures. In the second stage, we finetune the decoder of the pre-trained variational auto-encoder by adversarial GAN training for deterministic detail enhancement. Once trained, our proposed method, namely content consistent super-resolution (CCSR),allows flexible use of different diffusion steps in the inference stage without re-training. Extensive experiments show that with 2 or even 1 diffusion step, CCSR can significantly improve the content consistency of SR outputs while keeping high perceptual quality. Codes and models can be found at \href{https://github.com/csslc/CCSR}{https://github.com/csslc/CCSR}.

Improving the Stability and Efficiency of Diffusion Models for Content Consistent Super-Resolution

TL;DR

This work tackles the instability and limited control of diffusion-model-based super-resolution by introducing Content Consistent Super-Resolution (CCSR), a two-stage framework that separates structure generation from detail refinement. The first stage uses non-uniform timestep sampling in a diffusion process to extract coherent image structures from a low-resolution input, while the second stage fine-tunes a VAE decoder via adversarial training to deterministically enhance high-frequency details, enabling single-step or multi-step diffusion during inference. The authors show that CCSR improves content fidelity and perceptual quality and dramatically reduces stochastic variation across runs, as evidenced by new stability metrics and comprehensive experiments against both standard and efficient DM-based SR methods. The approach delivers a flexible, efficient SR solution that remains robust under real-world degradations and varying perception-fidelity requirements.

Abstract

The generative priors of pre-trained latent diffusion models (DMs) have demonstrated great potential to enhance the visual quality of image super-resolution (SR) results. However, the noise sampling process in DMs introduces randomness in the SR outputs, and the generated contents can differ a lot with different noise samples. The multi-step diffusion process can be accelerated by distilling methods, but the generative capacity is difficult to control. To address these issues, we analyze the respective advantages of DMs and generative adversarial networks (GANs) and propose to partition the generative SR process into two stages, where the DM is employed for reconstructing image structures and the GAN is employed for improving fine-grained details. Specifically, we propose a non-uniform timestep sampling strategy in the first stage. A single timestep sampling is first applied to extract the coarse information from the input image, then a few reverse steps are used to reconstruct the main structures. In the second stage, we finetune the decoder of the pre-trained variational auto-encoder by adversarial GAN training for deterministic detail enhancement. Once trained, our proposed method, namely content consistent super-resolution (CCSR),allows flexible use of different diffusion steps in the inference stage without re-training. Extensive experiments show that with 2 or even 1 diffusion step, CCSR can significantly improve the content consistency of SR outputs while keeping high perceptual quality. Codes and models can be found at \href{https://github.com/csslc/CCSR}{https://github.com/csslc/CCSR}.
Paper Structure (17 sections, 5 equations, 7 figures, 7 tables)

This paper contains 17 sections, 5 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Visual comparisons between the super-resolution outputs with the same input low-quality image but two different noise samples by different DM-based methods. $S$ denotes diffusion sampling timesteps. Please zoom in for a better view. Existing DM-based methods, including StableSR stablesr, PASD pasd, SeeSR seesr, SUPIR supir and AddSR AddSR, show noticeable instability with the different noise samples. OSEDiff osediff directly takes low-quality image as input withour noise sampling. It is deterministic and stable, but cannot perform multi-step diffusion for high generative capacity. In contrast, our proposed CCSR method is flexible for both multi-step diffusion and single-step diffusion, while producing stable results with high fidelity and visual quality.
  • Figure 2: Left: PSNR and LPIPS indices of SR outputs by SwinIR-$\ell_1$, SwinIR-GAN swinir and StableSR stablesr at different steps on the DIV2K dataset. Right: Visual comparisons of the SR results on three LR images of different quality levels. Please refer to Section \ref{['sec:motivation']} for detailed explanations of this figure.
  • Figure 3: Framework of our proposed CCSR. There are two stages in CCSR, structure refinement (top left) and detail enhancement (top right). In the first stage, a non-uniform sampling strategy (bottom) is proposed, which applies one timestep for information extraction from LR and several other timesteps for image structure generation. The diffusion process is then stopped and the truncated output is fed into the second stage, where the detail is enhanced by finetuning the VAE decoder with adversarial training.
  • Figure 4: Visual comparisons of CCSR and its variants 'V1' and 'V2'. One can see that the NUTS and DeFT strategies improve the super-resolution performance and stability.
  • Figure 5: Visual comparisons (better zoom-in on screen) between CCSR and state-of-the-art GAN-based and the standard DM-based SR methods. For each of the DM-based methods, two restored images that have the best and worst PSNR values over $10$ runs are shown for a more comprehensive and fair comparison. Our proposed CCSR works the best to reconstruct accurate structures and realistic, content-consistent and stable details.
  • ...and 2 more figures