Table of Contents
Fetching ...

Enhanced Semantic Extraction and Guidance for UGC Image Super Resolution

Yiwen Wang, Ying Liang, Yuxuan Zhang, Xinning Chai, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Rong Xie, Li Song

TL;DR

This work addresses the gap between real-world UGC degradations and synthetic degradations in single-image super-resolution by embedding semantic guidance into a diffusion-based framework. It constructs a more realistic training regime by combining LSDIR-based degradations with synthetic UGC data, and leverages SAM2 for high-level semantic conditioning alongside ControlNet to preserve structure. The semantic-aware module—integrated into the diffusion denoising process—improves both perceptual fidelity and semantic coherence, demonstrated through extensive quantitative and qualitative experiments, including strong performance on wild UGC data and competitive results on synthetic data and DIV2K. The approach effectively narrows the domain gap between synthetic and real-world degradations, offering a robust solution for practical UGC image enhancement with potential for further improvement in text regions and artifact control.

Abstract

Due to the disparity between real-world degradations in user-generated content(UGC) images and synthetic degradations, traditional super-resolution methods struggle to generalize effectively, necessitating a more robust approach to model real-world distortions. In this paper, we propose a novel approach to UGC image super-resolution by integrating semantic guidance into a diffusion framework. Our method addresses the inconsistency between degradations in wild and synthetic datasets by separately simulating the degradation processes on the LSDIR dataset and combining them with the official paired training set. Furthermore, we enhance degradation removal and detail generation by incorporating a pretrained semantic extraction model (SAM2) and fine-tuning key hyperparameters for improved perceptual fidelity. Extensive experiments demonstrate the superiority of our approach against state-of-the-art methods. Additionally, the proposed model won second place in the CVPR NTIRE 2025 Short-form UGC Image Super-Resolution Challenge, further validating its effectiveness. The code is available at https://github.c10pom/Moonsofang/NTIRE-2025-SRlab.

Enhanced Semantic Extraction and Guidance for UGC Image Super Resolution

TL;DR

This work addresses the gap between real-world UGC degradations and synthetic degradations in single-image super-resolution by embedding semantic guidance into a diffusion-based framework. It constructs a more realistic training regime by combining LSDIR-based degradations with synthetic UGC data, and leverages SAM2 for high-level semantic conditioning alongside ControlNet to preserve structure. The semantic-aware module—integrated into the diffusion denoising process—improves both perceptual fidelity and semantic coherence, demonstrated through extensive quantitative and qualitative experiments, including strong performance on wild UGC data and competitive results on synthetic data and DIV2K. The approach effectively narrows the domain gap between synthetic and real-world degradations, offering a robust solution for practical UGC image enhancement with potential for further improvement in text regions and artifact control.

Abstract

Due to the disparity between real-world degradations in user-generated content(UGC) images and synthetic degradations, traditional super-resolution methods struggle to generalize effectively, necessitating a more robust approach to model real-world distortions. In this paper, we propose a novel approach to UGC image super-resolution by integrating semantic guidance into a diffusion framework. Our method addresses the inconsistency between degradations in wild and synthetic datasets by separately simulating the degradation processes on the LSDIR dataset and combining them with the official paired training set. Furthermore, we enhance degradation removal and detail generation by incorporating a pretrained semantic extraction model (SAM2) and fine-tuning key hyperparameters for improved perceptual fidelity. Extensive experiments demonstrate the superiority of our approach against state-of-the-art methods. Additionally, the proposed model won second place in the CVPR NTIRE 2025 Short-form UGC Image Super-Resolution Challenge, further validating its effectiveness. The code is available at https://github.c10pom/Moonsofang/NTIRE-2025-SRlab.

Paper Structure

This paper contains 16 sections, 6 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Objective and subjective results of NTIRE 2025 Short-form UGC Image SR Challenge. The top six methods are included. The horizontal axis represents the objective score, which is computed as $\text{Score}=\text{PSNR}+10\times \text{SSIM}-10\times \text{LPIPS}+0.1\times \text{MUSIQ}+10\times \text{ManIQA}+10\times \text{CLIPIQA}$. The vertical axis represents the subjective score calculated by five experts. All results above are provided by the competition organizer.
  • Figure 2: (a) Overview of our proposed method. Our approach builds upon the diffusion framework, incorporating a mechanism to enforce structural consistency and preserve fidelity to the original LR image. Additionally, we leverage SAM2 for semantic-guided refinement, extracting high-level semantic embeddings to enhance adaptability to diverse degradation conditions. (b) Architecture of the PCA and SCA module. (c) Architecture of the SAM2 image encoder. The encoder comprises a trunk and a neck. The trunk extracts multi-scale features from the low-resolution image through four stages with varying numbers of MSB(Multi-Scale Block) layers. The neck applies convolutions at all scales and performs top-down feature fusion on low-resolution features. The output includes refined multi-scale feature maps and corresponding positional encodings.
  • Figure 3: Comparison of images from the wild and synthetic datasets before and after 1$\times$ and 4$\times$ super-resolution processing with our model.
  • Figure 4: Comparison of images with different models on synthetic and wild validation dataset.
  • Figure 5: Image restoration results of wild and synthetic datasets under different gs values.
  • ...and 1 more figures