HoliSDiP: Image Super-Resolution via Holistic Semantics and Diffusion Prior
Li-Yuan Tsao, Hao-Wei Chen, Hao-Wei Chung, Deqing Sun, Chun-Yi Lee, Kelvin C. K. Chan, Ming-Hsuan Yang
TL;DR
HoliSDiP tackles Real-ISR by integrating semantic segmentation into diffusion priors to provide both concise textual prompts and dense spatial guidance. The method introduces Semantic Label-Based Prompting (SLBP) for noise-free text guidance and Dense Semantic Guidance (DSG) including segmentation masks $M^s$ and Segmentation-CLIP Map $M^{sc}$ to steer local details via a Guidance Fusion Module (GFM). SLBP reduces prompt noise while DSG enables multi-scale, spatially aligned refinement through SAFT-based feature transformations. Empirical results on synthetic and real-world benchmarks show HoliSDiP achieves leading non-reference perceptual metrics and competitive fidelity, demonstrating improved texture fidelity and semantic consistency in Real-ISR outputs.
Abstract
Text-to-image diffusion models have emerged as powerful priors for real-world image super-resolution (Real-ISR). However, existing methods may produce unintended results due to noisy text prompts and their lack of spatial information. In this paper, we present HoliSDiP, a framework that leverages semantic segmentation to provide both precise textual and spatial guidance for diffusion-based Real-ISR. Our method employs semantic labels as concise text prompts while introducing dense semantic guidance through segmentation masks and our proposed Segmentation-CLIP Map. Extensive experiments demonstrate that HoliSDiP achieves significant improvement in image quality across various Real-ISR scenarios through reduced prompt noise and enhanced spatial control.
