Table of Contents
Fetching ...

Revisiting the Perception-Distortion Trade-off with Spatial-Semantic Guided Super-Resolution

Dan Wang, Haiyan Sun, Shan Du, Z. Jane Wang, Zhaochong An, Serge Belongie, Xinrui Cui

Abstract

Image super-resolution (SR) aims to reconstruct high resolution images with both high perceptual quality and low distortion, but is fundamentally limited by the perception-distortion trade-off. GAN-based SR methods reduce distortion but still struggle with realistic fine-grained textures, whereas diffusion-based approaches synthesize rich details but often deviate from the input, hallucinating structures and degrading fidelity. This tension raises a key challenge: how to exploit the powerful generative priors of diffusion models without sacrificing fidelity. To address this, we propose SpaSemSR, a spatial-semantic guided diffusion framework with two complementary guidances. First, spatial-grounded textual guidance integrates object-level spatial cues with semantic prompts, aligning textual and visual structures to reduce distortion. Second, semantic-enhanced visual guidance with a multi-encoder design and semantic degradation constraints unifies multimodal semantic priors, improving perceptual realism under severe degradations. These complementary guidances are adaptively fused into the diffusion process via spatial-semantic attention, suppressing distortion and hallucination while retaining the strengths of diffusion models. Extensive experiments on multiple benchmarks show that SpaSemSR achieves a superior perception-distortion balance, producing both realistic and faithful restorations.

Revisiting the Perception-Distortion Trade-off with Spatial-Semantic Guided Super-Resolution

Abstract

Image super-resolution (SR) aims to reconstruct high resolution images with both high perceptual quality and low distortion, but is fundamentally limited by the perception-distortion trade-off. GAN-based SR methods reduce distortion but still struggle with realistic fine-grained textures, whereas diffusion-based approaches synthesize rich details but often deviate from the input, hallucinating structures and degrading fidelity. This tension raises a key challenge: how to exploit the powerful generative priors of diffusion models without sacrificing fidelity. To address this, we propose SpaSemSR, a spatial-semantic guided diffusion framework with two complementary guidances. First, spatial-grounded textual guidance integrates object-level spatial cues with semantic prompts, aligning textual and visual structures to reduce distortion. Second, semantic-enhanced visual guidance with a multi-encoder design and semantic degradation constraints unifies multimodal semantic priors, improving perceptual realism under severe degradations. These complementary guidances are adaptively fused into the diffusion process via spatial-semantic attention, suppressing distortion and hallucination while retaining the strengths of diffusion models. Extensive experiments on multiple benchmarks show that SpaSemSR achieves a superior perception-distortion balance, producing both realistic and faithful restorations.
Paper Structure (16 sections, 12 equations, 5 figures, 4 tables)

This paper contains 16 sections, 12 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Perception-distortion trade-off in GAN-based, Diffusion-based, and Ours: GAN-based methods reduce distortion but produce blurry textures, while diffusion-based methods generate perceptually sharp yet hallucinated details. By integrating spatial-grounded textual guidance, SpaSemSR improves reconstruction fidelity (PSNR, SSIM in (a)), while semantic-enhanced visual guidance enhances perceptual quality (CLIP-IQA, MUSIQ, MANIQA in (b)), resulting in a better perception-distortion trade-off compared with GAN-based (c) and diffusion-based models (d).
  • Figure 2: Framework overview. (a) Spatial-aware text encoders generate position-grounded textual prompts (Sec. \ref{['Sec:Spatial-awareTextEncoders']}); (b) Semantic-enhanced image encoders extract semantic-enhanced visual features with degradation constraints (Sec. \ref{['sec:Semantic-enhancedImageEncoders']}); (c) Spatial-semantic ControlNet integrates these multimodal conditions (Sec. \ref{['sec:Spatial-SemanticControlNet']}); (d) Spatial-semantic guided diffusion fuses semantic and spatial guidance with generative priors (Sec. \ref{['sec:Spatial-SemanticDiffusion']}).
  • Figure 3: Qualitative comparisons with different methods. Zoom in for a better view.
  • Figure 4: Visualization of spatial-grounded textual and semantic-enhanced visual guidance.
  • Figure 5: Ablation visualization with different variants on real-world datasets. Zoom in for a better view.