Table of Contents
Fetching ...

HoliSDiP: Image Super-Resolution via Holistic Semantics and Diffusion Prior

Li-Yuan Tsao, Hao-Wei Chen, Hao-Wei Chung, Deqing Sun, Chun-Yi Lee, Kelvin C. K. Chan, Ming-Hsuan Yang

TL;DR

HoliSDiP tackles Real-ISR by integrating semantic segmentation into diffusion priors to provide both concise textual prompts and dense spatial guidance. The method introduces Semantic Label-Based Prompting (SLBP) for noise-free text guidance and Dense Semantic Guidance (DSG) including segmentation masks $M^s$ and Segmentation-CLIP Map $M^{sc}$ to steer local details via a Guidance Fusion Module (GFM). SLBP reduces prompt noise while DSG enables multi-scale, spatially aligned refinement through SAFT-based feature transformations. Empirical results on synthetic and real-world benchmarks show HoliSDiP achieves leading non-reference perceptual metrics and competitive fidelity, demonstrating improved texture fidelity and semantic consistency in Real-ISR outputs.

Abstract

Text-to-image diffusion models have emerged as powerful priors for real-world image super-resolution (Real-ISR). However, existing methods may produce unintended results due to noisy text prompts and their lack of spatial information. In this paper, we present HoliSDiP, a framework that leverages semantic segmentation to provide both precise textual and spatial guidance for diffusion-based Real-ISR. Our method employs semantic labels as concise text prompts while introducing dense semantic guidance through segmentation masks and our proposed Segmentation-CLIP Map. Extensive experiments demonstrate that HoliSDiP achieves significant improvement in image quality across various Real-ISR scenarios through reduced prompt noise and enhanced spatial control.

HoliSDiP: Image Super-Resolution via Holistic Semantics and Diffusion Prior

TL;DR

HoliSDiP tackles Real-ISR by integrating semantic segmentation into diffusion priors to provide both concise textual prompts and dense spatial guidance. The method introduces Semantic Label-Based Prompting (SLBP) for noise-free text guidance and Dense Semantic Guidance (DSG) including segmentation masks and Segmentation-CLIP Map to steer local details via a Guidance Fusion Module (GFM). SLBP reduces prompt noise while DSG enables multi-scale, spatially aligned refinement through SAFT-based feature transformations. Empirical results on synthetic and real-world benchmarks show HoliSDiP achieves leading non-reference perceptual metrics and competitive fidelity, demonstrating improved texture fidelity and semantic consistency in Real-ISR outputs.

Abstract

Text-to-image diffusion models have emerged as powerful priors for real-world image super-resolution (Real-ISR). However, existing methods may produce unintended results due to noisy text prompts and their lack of spatial information. In this paper, we present HoliSDiP, a framework that leverages semantic segmentation to provide both precise textual and spatial guidance for diffusion-based Real-ISR. Our method employs semantic labels as concise text prompts while introducing dense semantic guidance through segmentation masks and our proposed Segmentation-CLIP Map. Extensive experiments demonstrate that HoliSDiP achieves significant improvement in image quality across various Real-ISR scenarios through reduced prompt noise and enhanced spatial control.

Paper Structure

This paper contains 24 sections, 4 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The proposed HoliSDiP performs well against the state-of-the-art frameworks seesrdiffbir by offering precise and multi-scale semantics, guiding the text-to-image diffusion model to synthesize high-quality images with fine details.
  • Figure 2: Comparison between the proposed HoliSDiP and existing Real-ISR methods. (a) Current studies leverage text prompts that include redundant descriptions and lack localized priors. (b) Our HoliSDiP leverages semantic segmentation to offer clear text prompts and dense semantic guidance.
  • Figure 3: Overview of HoliSDiP. The segmentation model first processes the LR image to generate segmentation results, which is used for extracting semantic labels, segmentation mask, and Segmentation-Clip Map (SCMap). The semantic labels are employed as text prompts to inject textual guidance through cross-attention layers, while the segmentation mask and SCMap are integrated by our Guidance Fusion Module to facilitate semantic-adaptive feature transformation. Additionally, ControlNet and LR cross-attention layers are utilized to strengthen guidance from the LR image. These conditions are incorporated into the denoising UNet, which iteratively refines the noisy input to produce the final SR image.
  • Figure 4: Qualitative comparison between the proposed HoliSDiP and contemporary Real-ISR methods. HoliSDiP presents sharper details without introducing noticeable visual artifacts across various Real-ISR scenarios.
  • Figure 5: Qualitative comparison between using image tags seesr as prompts (+T) and employing the proposed Semantic Label-Based Prompting (+L) in HoliSDiP. The proposed prompting scheme provides more concise description, reducing visual artifacts.
  • ...and 2 more figures