Table of Contents
Fetching ...

Semantic Segmentation Prior for Diffusion-Based Real-World Super-Resolution

Jiahua Xiao, Jiawei Zhang, Dongqing Zou, Xiaodan Zhang, Jimmy Ren, Xing Wei

TL;DR

This work tackles Real-ISR by addressing semantic mislocalization and ambiguity in diffusion-based restoration. It introduces SegSR, a dual-diffusion framework that couples a diffusion-based SR model (SRDM) with a diffusion-based segmentation model (SegDM) through a Dual-Modality Bridge (DMB), enabling mutual refinement of image content and segmentation priors during reverse diffusion. By leveraging pixel-level segmentation labels as priors, SegSR improves semantic fidelity while maintaining perceptual realism, outperforming several state-of-the-art methods on synthetic and real-world benchmarks, particularly in non-reference quality metrics. The approach demonstrates that integrating segmentation priors into generative restoration can enhance both semantic accuracy and visual quality, with practical impact for real-world image enhancement tasks.

Abstract

Real-world image super-resolution (Real-ISR) has achieved a remarkable leap by leveraging large-scale text-to-image models, enabling realistic image restoration from given recognition textual prompts. However, these methods sometimes fail to recognize some salient objects, resulting in inaccurate semantic restoration in these regions. Additionally, the same region may have a strong response to more than one prompt and it will lead to semantic ambiguity for image super-resolution. To alleviate the above two issues, in this paper, we propose to consider semantic segmentation as an additional control condition into diffusion-based image super-resolution. Compared to textual prompt conditions, semantic segmentation enables a more comprehensive perception of salient objects within an image by assigning class labels to each pixel. It also mitigates the risks of semantic ambiguities by explicitly allocating objects to their respective spatial regions. In practice, inspired by the fact that image super-resolution and segmentation can benefit each other, we propose SegSR which introduces a dual-diffusion framework to facilitate interaction between the image super-resolution and segmentation diffusion models. Specifically, we develop a Dual-Modality Bridge module to enable updated information flow between these two diffusion models, achieving mutual benefit during the reverse diffusion process. Extensive experiments show that SegSR can generate realistic images while preserving semantic structures more effectively.

Semantic Segmentation Prior for Diffusion-Based Real-World Super-Resolution

TL;DR

This work tackles Real-ISR by addressing semantic mislocalization and ambiguity in diffusion-based restoration. It introduces SegSR, a dual-diffusion framework that couples a diffusion-based SR model (SRDM) with a diffusion-based segmentation model (SegDM) through a Dual-Modality Bridge (DMB), enabling mutual refinement of image content and segmentation priors during reverse diffusion. By leveraging pixel-level segmentation labels as priors, SegSR improves semantic fidelity while maintaining perceptual realism, outperforming several state-of-the-art methods on synthetic and real-world benchmarks, particularly in non-reference quality metrics. The approach demonstrates that integrating segmentation priors into generative restoration can enhance both semantic accuracy and visual quality, with practical impact for real-world image enhancement tasks.

Abstract

Real-world image super-resolution (Real-ISR) has achieved a remarkable leap by leveraging large-scale text-to-image models, enabling realistic image restoration from given recognition textual prompts. However, these methods sometimes fail to recognize some salient objects, resulting in inaccurate semantic restoration in these regions. Additionally, the same region may have a strong response to more than one prompt and it will lead to semantic ambiguity for image super-resolution. To alleviate the above two issues, in this paper, we propose to consider semantic segmentation as an additional control condition into diffusion-based image super-resolution. Compared to textual prompt conditions, semantic segmentation enables a more comprehensive perception of salient objects within an image by assigning class labels to each pixel. It also mitigates the risks of semantic ambiguities by explicitly allocating objects to their respective spatial regions. In practice, inspired by the fact that image super-resolution and segmentation can benefit each other, we propose SegSR which introduces a dual-diffusion framework to facilitate interaction between the image super-resolution and segmentation diffusion models. Specifically, we develop a Dual-Modality Bridge module to enable updated information flow between these two diffusion models, achieving mutual benefit during the reverse diffusion process. Extensive experiments show that SegSR can generate realistic images while preserving semantic structures more effectively.

Paper Structure

This paper contains 11 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Comparison of Real-ISR results between SegSR conditioned on segmentation masks and prompt-guided methods (examplifed by SeeSR seesr). (a) SeeSR fails to recognize some salient components in the image. The cross-attention maps show the attention weight allocation of other objects in the region, leading to the inaccurate generation of semantic details. (b) The same region have a strong response to more than one prompt through cross-attention and it lead to semantic ambiguity outcomes for image restoration. In the cross-attention map visualization, warmer color indicate higher attention weights, while cooler color represent lower attention weights. In contrast, SegSR can restore more faithful details as long as the estimated segmentation mask is accurate.
  • Figure 2: Mutual Refinement within SegSR. We present the final result predictions at different steps $t$ of the inverse diffusion process for both SRDM and SegDM. (a) To provide semantic segmentation priors for Real-ISR, the pretrained Segformer segformer predicts segmentation masks from degraded images, but these predictions become inaccurate when the degradation is severe. As a result, the Segformer-guided SRDM struggles to restore images with high semantic fidelity. (b) The proposed SRDM and SegDM mutually benefit from each other through the DMB in SegSR, progressively improving segmentation predictions and image quality through the inverse diffusion process.
  • Figure 3: Overview of SegSR. Framework comprises three key parts: i) SRDM performs super-resolution diffusion process, conditioned on LQ image embeding $Z_{lq}$ and gradually updated segmentation prior $S_t$ from SegDM to generate high-realness image; ii) SegDM conducts semantic segmentation diffusion process, conditioned on LQ image features $F_{lq}$ and iteratively restored image information $Z_t$ from SRDM to improve the accuracy of segmentation priors; (iii) the DMB module, which encodes intermediate updated features $Z_{t-1}$ and $S_{t-1}$ from SRDM and SegDM from the previous step, producing the image and segmentation conditions $I_{t}$ and $C_{t}$ for the current time step. SRDM and SegDM collaborate through the DMB module to ultimately achieve realistic image super-resolution.
  • Figure 4: Qualitative comparisons on synthetic benchmarks: DIV2K-Val div2k (top) and OST-Val sftgan (bottom). Please zoom in for details.
  • Figure 5: Qualitative comparisons on real-world benchmarks: RealSR realsr (top) and RealLQ250 dreamclear (bottom). lease zoom in for details.
  • ...and 1 more figures