MegaSR: Mining Customized Semantics and Expressive Guidance for Real-World Image Super-Resolution
Xinrui Li, Jinrong Zhang, Jianlong Wu, Chong Chen, Liqiang Nie, Zhouchen Lin
TL;DR
MegaSR tackles real-world image super-resolution by identifying three failure modes of T2I-based approaches: fine-detail deficiency, block-wise semantic misalignment, and edge ambiguity. It introduces a Customized Semantics Module with Dual-Path Cross-Attention and Learnable Gated Weight Adaptation to tailor multi-level semantics per U-Net block (CSM), and a Multimodal Signal Fusion Module to inject expressive guidance from depth, segmentation, and edge cues (MSFM). Two prior-guided fine-tuning strategies ensure signal extractors adapt to degraded inputs, while two-stage fusion integrates multimodal signals coherently into the diffusion backbone. Extensive experiments on real and synthetic datasets show MegaSR achieves state-of-the-art perceptual quality with competitive fidelity, validating its semantic richness and structural consistency for practical Real-ISR tasks.
Abstract
Text-to-image (T2I) models have ushered in a new era of real-world image super-resolution (Real-ISR) due to their rich internal implicit knowledge for multimodal learning. Although bringing high-level semantic priors and dense pixel guidance have led to advances in reconstruction, we identified several critical phenomena by analyzing the behavior of existing T2I-based Real-ISR methods: (1) Fine detail deficiency, which ultimately leads to incorrect reconstruction in local regions. (2) Block-wise semantic inconsistency, which results in distracted semantic interpretations across U-Net blocks. (3) Edge ambiguity, which causes noticeable structural degradation. Building upon these observations, we first introduce MegaSR, which enhances the T2I-based Real-ISR models with fine-grained customized semantics and expressive guidance to unlock semantically rich and structurally consistent reconstruction. Then, we propose the Customized Semantics Module (CSM) to supplement fine-grained semantics from the image modality and regulate the semantic fusion between multi-level knowledge to realize customization for different U-Net blocks. Besides the semantic adaptation, we identify expressive multimodal signals through pair-wise comparisons and introduce the Multimodal Signal Fusion Module (MSFM) to aggregate them for structurally consistent reconstruction. Extensive experiments on real-world and synthetic datasets demonstrate the superiority of the method. Notably, it not only achieves state-of-the-art performance on quality-driven metrics but also remains competitive on fidelity-focused metrics, striking a balance between perceptual realism and faithful content reconstruction.
