Table of Contents
Fetching ...

MegaSR: Mining Customized Semantics and Expressive Guidance for Real-World Image Super-Resolution

Xinrui Li, Jinrong Zhang, Jianlong Wu, Chong Chen, Liqiang Nie, Zhouchen Lin

TL;DR

MegaSR tackles real-world image super-resolution by identifying three failure modes of T2I-based approaches: fine-detail deficiency, block-wise semantic misalignment, and edge ambiguity. It introduces a Customized Semantics Module with Dual-Path Cross-Attention and Learnable Gated Weight Adaptation to tailor multi-level semantics per U-Net block (CSM), and a Multimodal Signal Fusion Module to inject expressive guidance from depth, segmentation, and edge cues (MSFM). Two prior-guided fine-tuning strategies ensure signal extractors adapt to degraded inputs, while two-stage fusion integrates multimodal signals coherently into the diffusion backbone. Extensive experiments on real and synthetic datasets show MegaSR achieves state-of-the-art perceptual quality with competitive fidelity, validating its semantic richness and structural consistency for practical Real-ISR tasks.

Abstract

Text-to-image (T2I) models have ushered in a new era of real-world image super-resolution (Real-ISR) due to their rich internal implicit knowledge for multimodal learning. Although bringing high-level semantic priors and dense pixel guidance have led to advances in reconstruction, we identified several critical phenomena by analyzing the behavior of existing T2I-based Real-ISR methods: (1) Fine detail deficiency, which ultimately leads to incorrect reconstruction in local regions. (2) Block-wise semantic inconsistency, which results in distracted semantic interpretations across U-Net blocks. (3) Edge ambiguity, which causes noticeable structural degradation. Building upon these observations, we first introduce MegaSR, which enhances the T2I-based Real-ISR models with fine-grained customized semantics and expressive guidance to unlock semantically rich and structurally consistent reconstruction. Then, we propose the Customized Semantics Module (CSM) to supplement fine-grained semantics from the image modality and regulate the semantic fusion between multi-level knowledge to realize customization for different U-Net blocks. Besides the semantic adaptation, we identify expressive multimodal signals through pair-wise comparisons and introduce the Multimodal Signal Fusion Module (MSFM) to aggregate them for structurally consistent reconstruction. Extensive experiments on real-world and synthetic datasets demonstrate the superiority of the method. Notably, it not only achieves state-of-the-art performance on quality-driven metrics but also remains competitive on fidelity-focused metrics, striking a balance between perceptual realism and faithful content reconstruction.

MegaSR: Mining Customized Semantics and Expressive Guidance for Real-World Image Super-Resolution

TL;DR

MegaSR tackles real-world image super-resolution by identifying three failure modes of T2I-based approaches: fine-detail deficiency, block-wise semantic misalignment, and edge ambiguity. It introduces a Customized Semantics Module with Dual-Path Cross-Attention and Learnable Gated Weight Adaptation to tailor multi-level semantics per U-Net block (CSM), and a Multimodal Signal Fusion Module to inject expressive guidance from depth, segmentation, and edge cues (MSFM). Two prior-guided fine-tuning strategies ensure signal extractors adapt to degraded inputs, while two-stage fusion integrates multimodal signals coherently into the diffusion backbone. Extensive experiments on real and synthetic datasets show MegaSR achieves state-of-the-art perceptual quality with competitive fidelity, validating its semantic richness and structural consistency for practical Real-ISR tasks.

Abstract

Text-to-image (T2I) models have ushered in a new era of real-world image super-resolution (Real-ISR) due to their rich internal implicit knowledge for multimodal learning. Although bringing high-level semantic priors and dense pixel guidance have led to advances in reconstruction, we identified several critical phenomena by analyzing the behavior of existing T2I-based Real-ISR methods: (1) Fine detail deficiency, which ultimately leads to incorrect reconstruction in local regions. (2) Block-wise semantic inconsistency, which results in distracted semantic interpretations across U-Net blocks. (3) Edge ambiguity, which causes noticeable structural degradation. Building upon these observations, we first introduce MegaSR, which enhances the T2I-based Real-ISR models with fine-grained customized semantics and expressive guidance to unlock semantically rich and structurally consistent reconstruction. Then, we propose the Customized Semantics Module (CSM) to supplement fine-grained semantics from the image modality and regulate the semantic fusion between multi-level knowledge to realize customization for different U-Net blocks. Besides the semantic adaptation, we identify expressive multimodal signals through pair-wise comparisons and introduce the Multimodal Signal Fusion Module (MSFM) to aggregate them for structurally consistent reconstruction. Extensive experiments on real-world and synthetic datasets demonstrate the superiority of the method. Notably, it not only achieves state-of-the-art performance on quality-driven metrics but also remains competitive on fidelity-focused metrics, striking a balance between perceptual realism and faithful content reconstruction.

Paper Structure

This paper contains 29 sections, 11 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Different phenomena observed in existing methods. (a) Solely using textual semantics for reconstruction results in erroneous fine-grained semantics or missing fine details. (b) The blocks at the two ends are sensitive to attribute-level concepts, while the middle blocks focus on instance-level concepts. (c) Semantic segmentation masks introduce edge ambiguity within the same semantic region and lead to artifacts in the results.
  • Figure 2: Detailed statements of the phenomena. (a) Incorporating fine-grained visual semantics contributes to both improved visual clarity and enhanced semantic fidelity. (b) Applying different prompts to U-Net blocks at varying widths demonstrates that wide and narrow blocks in T2I models play distinct roles. (c) Determining the relative intensity between signals by pairing them as inputs to Uni-ControlNet zhao_unicontrolnet_2023 and evaluating the structural differences in the generated results.
  • Figure 3: Framework of the proposed method. Firstly, based on the T2I U-Net, it takes LR images as input. Then, it utilizes RAM zhang_ram_2024 and PGFT-CLIPV model to extract coarse-grained textual and fine-grained visual semantics to DPCA, and dynamically adjusts their weights at different U-Net blocks with LGWAM. Next, it employs prior-guided fine-tuned extractors to obtain multimodal signals, which are progressively injected into the representations of T2I models via MSFM. Finally, it produces HR images that are both semantically rich and structurally consistent.
  • Figure 4: Prior-guided fine-tuning strategies of the signal extractors. (a) For extractors with weaker degradation priors, we apply full-parameter fine-tuning to ensure flexibility for adaptation. (b) For extractors with stronger priors, we accelerate the process using parameter-efficient fine-tuning.
  • Figure 5: Qualitative comparisons with different Real-ISR methods. The proposed method achieves superior fidelity and realism in terms of semantic preservation and structural consistency.
  • ...and 5 more figures