Table of Contents
Fetching ...

ICM-SR: Image-Conditioned Manifold Regularization for Image Super-Resoultion

Junoh Kang, Donghun Ryou, Bohyung Han

TL;DR

Real-ISR methods using diffusion priors suffer from a misaligned target manifold when conditioned on text prompts. The authors propose Image-Conditioned Manifold (ICM) regularization, which anchors the SR output to a manifold conditioned on sparse structural cues (a low-res colormap and Canny edges) via a T2I-Adapter, improving stability and perceptual quality. The framework trains a one-step SR model with reconstruction loss and ICM regularization while learning an auxiliary diffusion model to capture the generator’s output distribution. Experiments on DIV2K-Val, RealSR, and DRealSR show state-of-the-art perceptual performance among one-step methods, with ablations validating the effectiveness of the structural conditioning design. The work offers a practical, faster-than-multi-step Real-ISR solution and provides code for reproducibility.

Abstract

Real world image super-resolution (Real-ISR) often leverages the powerful generative priors of text-to-image diffusion models by regularizing the output to lie on their learned manifold. However, existing methods often overlook the importance of the regularizing manifold, typically defaulting to a text-conditioned manifold. This approach suffers from two key limitations. Conceptually, it is misaligned with the Real-ISR task, which is to generate high quality (HQ) images directly tied to the low quality (LQ) images. Practically, the teacher model often reconstructs images with color distortions and blurred edges, indicating a flawed generative prior for this task. To correct these flaws and ensure conceptual alignment, a more suitable manifold must incorporate information from the images. While the most straightforward approach is to condition directly on the raw input images, their high information densities make the regularization process numerically unstable. To resolve this, we propose image-conditioned manifold regularization (ICM), a method that regularizes the output towards a manifold conditioned on the sparse yet essential structural information: a combination of colormap and Canny edges. ICM provides a task-aligned and stable regularization signal, thereby avoiding the instability of dense-conditioning and enhancing the final super-resolution quality. Our experiments confirm that the proposed regularization significantly enhances super-resolution performance, particularly in perceptual quality, demonstrating its effectiveness for real-world applications. We will release the source code of our work for reproducibility.

ICM-SR: Image-Conditioned Manifold Regularization for Image Super-Resoultion

TL;DR

Real-ISR methods using diffusion priors suffer from a misaligned target manifold when conditioned on text prompts. The authors propose Image-Conditioned Manifold (ICM) regularization, which anchors the SR output to a manifold conditioned on sparse structural cues (a low-res colormap and Canny edges) via a T2I-Adapter, improving stability and perceptual quality. The framework trains a one-step SR model with reconstruction loss and ICM regularization while learning an auxiliary diffusion model to capture the generator’s output distribution. Experiments on DIV2K-Val, RealSR, and DRealSR show state-of-the-art perceptual performance among one-step methods, with ablations validating the effectiveness of the structural conditioning design. The work offers a practical, faster-than-multi-step Real-ISR solution and provides code for reproducibility.

Abstract

Real world image super-resolution (Real-ISR) often leverages the powerful generative priors of text-to-image diffusion models by regularizing the output to lie on their learned manifold. However, existing methods often overlook the importance of the regularizing manifold, typically defaulting to a text-conditioned manifold. This approach suffers from two key limitations. Conceptually, it is misaligned with the Real-ISR task, which is to generate high quality (HQ) images directly tied to the low quality (LQ) images. Practically, the teacher model often reconstructs images with color distortions and blurred edges, indicating a flawed generative prior for this task. To correct these flaws and ensure conceptual alignment, a more suitable manifold must incorporate information from the images. While the most straightforward approach is to condition directly on the raw input images, their high information densities make the regularization process numerically unstable. To resolve this, we propose image-conditioned manifold regularization (ICM), a method that regularizes the output towards a manifold conditioned on the sparse yet essential structural information: a combination of colormap and Canny edges. ICM provides a task-aligned and stable regularization signal, thereby avoiding the instability of dense-conditioning and enhancing the final super-resolution quality. Our experiments confirm that the proposed regularization significantly enhances super-resolution performance, particularly in perceptual quality, demonstrating its effectiveness for real-world applications. We will release the source code of our work for reproducibility.

Paper Structure

This paper contains 32 sections, 1 theorem, 13 equations, 7 figures, 5 tables, 1 algorithm.

Key Result

Lemma 1

Let $\mathbf{c}$ be a strong condition such that the latent variable $\mathbf{z}_0$ is deterministic, i.e., $\mathbf{z}_0|\mathbf{c} = \mu(\mathbf{c})$. If the perturbed distribution $q_t(\mathbf{z}_t|\mathbf{c})$ is generated by $\mathbf{z}_t = a_t \mathbf{z}_0 + b_t \epsilon$ with $\epsilon \sim \ Consequently, the gradient of the VSD loss $\nabla_\theta \mathcal{L}_\text{VSD}$ degenerates to th

Figures (7)

  • Figure 1: Performance comparison on DRealSR benchmark wei2020component. The red and blue metrics are no-reference and reference perceptual metrics, respectively. ICM-SR (ours) stands out for perceptual metrics, highlighting its strong performance in practical scenarios.
  • Figure 2: Visualization of denoised latents from the teacher diffusion model. We add noise corresponding to timestep $t$ to the ground-truch latent and visualize the model's single-step denoised prediction. (Text-cond.) The standard text-conditioned prior struggles to reconstruct the image from noisy latents, especially at large $t$. It produces oversaturated colors (left) and fails to recover edges (right). (Image-cond.) In contrast, our proposed image-conditioned prior, guided by a colormap and Canny edges, provides a much more accurate prediction.
  • Figure 3: Training framework of ICM-SR. Our framework trains a one-step super-resolution generator using two main losses. A reconstruction loss $\mathcal{L}_\text{Rec}$ ensures fidelity to the ground-truth $\mathbf{x}_H$. For realism, a VSD-based regularization loss $\mathcal{L}_\text{Reg}$ is applied, which involves a frozen pre-trained diffusion model $\epsilon_\phi$ and a trainable auxiliary model $\epsilon_\psi$. The key innovation of our method is to condition the target manifold on structural information $\mathbf{F}_c$ (e.g., edges, resized image) from the HQ image $\mathbf{x}_H$. These conditions are encoded by T2I-Adapter $A_\eta$ and then injected into both $\epsilon_\phi$ and $\epsilon_\psi$ to guide the generator $G_\theta$ towards producing outputs that are not only realistic but also structurally aligned with the target image.
  • Figure : Qualitative results of our method compared to OSEDiff and TSD-SR on the Div2k validation dataset. Our method demonstrates superior performance in recovering fine details. Zoom in for better visualization.
  • Figure : Qualitative comparison of our method with various multi-step and one-step diffusion-based methods. ‘s’ denotes the number of network inferences in the method. Zoom in for better visualization.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Lemma 1
  • proof