Table of Contents
Fetching ...

Pixel-level and Semantic-level Adjustable Super-resolution: A Dual-LoRA Approach

Lingchen Sun, Rongyuan Wu, Zhiyuan Ma, Shuaizheng Liu, Qiaosi Yi, Lei Zhang

TL;DR

PiSA-SR tackles the entangled pixel-level and semantic-level objectives in real-world SR by introducing two LoRA adapters on a pre-trained diffusion model, enabling residual-learning in latent space with $z_H = z_L - \lambda \epsilon_\theta(z_L)$. It decouples optimization into pixel- and semantic-level components, using $\\ell_2$ loss for pixel fidelity and LPIPS plus classifier score distillation (CSD) losses for semantic refinement, then enables inference-time adjustment via $\\lambda_{pix}$ and $\\lambda_{sem}$ to tailor results without retraining. The approach delivers high-quality, efficient one-step diffusion SR and demonstrates favorable trade-offs across PSNR/LPIPS and no-reference metrics on synthetic and real-world data, with practical adjustability for user preferences. This work offers a scalable, flexible pathway for real-world SR applications where fidelity and perceptual quality must be balanced on demand.

Abstract

Diffusion prior-based methods have shown impressive results in real-world image super-resolution (SR). However, most existing methods entangle pixel-level and semantic-level SR objectives in the training process, struggling to balance pixel-wise fidelity and perceptual quality. Meanwhile, users have varying preferences on SR results, thus it is demanded to develop an adjustable SR model that can be tailored to different fidelity-perception preferences during inference without re-training. We present Pixel-level and Semantic-level Adjustable SR (PiSA-SR), which learns two LoRA modules upon the pre-trained stable-diffusion (SD) model to achieve improved and adjustable SR results. We first formulate the SD-based SR problem as learning the residual between the low-quality input and the high-quality output, then show that the learning objective can be decoupled into two distinct LoRA weight spaces: one is characterized by the $\ell_2$-loss for pixel-level regression, and another is characterized by the LPIPS and classifier score distillation losses to extract semantic information from pre-trained classification and SD models. In its default setting, PiSA-SR can be performed in a single diffusion step, achieving leading real-world SR results in both quality and efficiency. By introducing two adjustable guidance scales on the two LoRA modules to control the strengths of pixel-wise fidelity and semantic-level details during inference, PiSASR can offer flexible SR results according to user preference without re-training. Codes and models can be found at https://github.com/csslc/PiSA-SR.

Pixel-level and Semantic-level Adjustable Super-resolution: A Dual-LoRA Approach

TL;DR

PiSA-SR tackles the entangled pixel-level and semantic-level objectives in real-world SR by introducing two LoRA adapters on a pre-trained diffusion model, enabling residual-learning in latent space with . It decouples optimization into pixel- and semantic-level components, using loss for pixel fidelity and LPIPS plus classifier score distillation (CSD) losses for semantic refinement, then enables inference-time adjustment via and to tailor results without retraining. The approach delivers high-quality, efficient one-step diffusion SR and demonstrates favorable trade-offs across PSNR/LPIPS and no-reference metrics on synthetic and real-world data, with practical adjustability for user preferences. This work offers a scalable, flexible pathway for real-world SR applications where fidelity and perceptual quality must be balanced on demand.

Abstract

Diffusion prior-based methods have shown impressive results in real-world image super-resolution (SR). However, most existing methods entangle pixel-level and semantic-level SR objectives in the training process, struggling to balance pixel-wise fidelity and perceptual quality. Meanwhile, users have varying preferences on SR results, thus it is demanded to develop an adjustable SR model that can be tailored to different fidelity-perception preferences during inference without re-training. We present Pixel-level and Semantic-level Adjustable SR (PiSA-SR), which learns two LoRA modules upon the pre-trained stable-diffusion (SD) model to achieve improved and adjustable SR results. We first formulate the SD-based SR problem as learning the residual between the low-quality input and the high-quality output, then show that the learning objective can be decoupled into two distinct LoRA weight spaces: one is characterized by the -loss for pixel-level regression, and another is characterized by the LPIPS and classifier score distillation losses to extract semantic information from pre-trained classification and SD models. In its default setting, PiSA-SR can be performed in a single diffusion step, achieving leading real-world SR results in both quality and efficiency. By introducing two adjustable guidance scales on the two LoRA modules to control the strengths of pixel-wise fidelity and semantic-level details during inference, PiSASR can offer flexible SR results according to user preference without re-training. Codes and models can be found at https://github.com/csslc/PiSA-SR.

Paper Structure

This paper contains 12 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Visual illustration of our pixel- and semantic-level adjustable method for real-world SR. By increasing the pixel-level guidance scale $\lambda_{pix}$, the image degradations such as noise and compression artifacts can be gradually removed; however, a too-strong $\lambda_{pix}$ will make the SR image over-smoothed. By increasing the semantic-level guidance scale $\lambda_{sem}$, the SR images will have more semantic details; nonetheless, a too-high $\lambda_{sem}$ will generate visual artifacts. Please zoom in for a better view.
  • Figure 2: Comparison of the pipeline of different DM-based SR methods. (a) Multi-step methods stablesrdiffbirpasdseesrxpsrdreamclear perform $T$ denoising steps starting from Gaussian noise $z_T$, conditioned on the LQ image $x_L$. (b) OSEDiff osediff starts from LQ latent representation $z_L$ with only one-step diffusion. (c) Our proposed PiSA-SR formulates the SD-based SR as learning the residual between the LQ latent $z_L$ and HQ latent $z_H$.
  • Figure 3: The (a) training and (b) inference procedures of PiSA-SR. During the training process, two LoRA modules are respectively optimized for pixel-level and semantic-level enhancement. During the inference stage, users can use the default setting to reconstruct the HQ image in one-step diffusion or adjust $\lambda_{pix}$ and $\lambda_{sem}$ to control the strengths of pixel-level and semantic-level enhancement.
  • Figure 4: The model outputs with pixel-wise and semantic-level losses for a given LQ image.
  • Figure 5: Visual comparisons of different DM-based SR methods. Please zoom in for a better view.