Table of Contents
Fetching ...

Hero-SR: One-Step Diffusion for Super-Resolution with Human Perception Priors

Jiangang Wang, Qingnan Fan, Qi Zhang, Haigen Liu, Yuhang Yu, Jinwei Chen, Wenqi Ren

TL;DR

Real-world SR demands reconstructions that satisfy semantic consistency and perceptual naturalness under heavy degradation. Hero-SR addresses this with a one-step diffusion framework augmented by Dynamic Time-Step Module (DTSM) and Open-World Multi-modality Supervision (OWMS), leveraging CLIP guidance across text and image domains. DTSM adaptively selects the diffusion step $t^*$ from a candidate set via Gumbel-Softmax, while OWMS imposes Text-Domain Perceptual Alignment Loss and Image-Domain Semantic Alignment Loss to align outputs with human perception. Experiments on synthetic and real datasets show Hero-SR achieves state-of-the-art performance among one-step methods and strong results against multi-step baselines, especially in perceptual quality metrics; limitations include the VAE’s capacity to recover very small structures, suggesting avenues for future refinement.

Abstract

Owing to the robust priors of diffusion models, recent approaches have shown promise in addressing real-world super-resolution (Real-SR). However, achieving semantic consistency and perceptual naturalness to meet human perception demands remains difficult, especially under conditions of heavy degradation and varied input complexities. To tackle this, we propose Hero-SR, a one-step diffusion-based SR framework explicitly designed with human perception priors. Hero-SR consists of two novel modules: the Dynamic Time-Step Module (DTSM), which adaptively selects optimal diffusion steps for flexibly meeting human perceptual standards, and the Open-World Multi-modality Supervision (OWMS), which integrates guidance from both image and text domains through CLIP to improve semantic consistency and perceptual naturalness. Through these modules, Hero-SR generates high-resolution images that not only preserve intricate details but also reflect human perceptual preferences. Extensive experiments validate that Hero-SR achieves state-of-the-art performance in Real-SR. The code will be publicly available upon paper acceptance.

Hero-SR: One-Step Diffusion for Super-Resolution with Human Perception Priors

TL;DR

Real-world SR demands reconstructions that satisfy semantic consistency and perceptual naturalness under heavy degradation. Hero-SR addresses this with a one-step diffusion framework augmented by Dynamic Time-Step Module (DTSM) and Open-World Multi-modality Supervision (OWMS), leveraging CLIP guidance across text and image domains. DTSM adaptively selects the diffusion step from a candidate set via Gumbel-Softmax, while OWMS imposes Text-Domain Perceptual Alignment Loss and Image-Domain Semantic Alignment Loss to align outputs with human perception. Experiments on synthetic and real datasets show Hero-SR achieves state-of-the-art performance among one-step methods and strong results against multi-step baselines, especially in perceptual quality metrics; limitations include the VAE’s capacity to recover very small structures, suggesting avenues for future refinement.

Abstract

Owing to the robust priors of diffusion models, recent approaches have shown promise in addressing real-world super-resolution (Real-SR). However, achieving semantic consistency and perceptual naturalness to meet human perception demands remains difficult, especially under conditions of heavy degradation and varied input complexities. To tackle this, we propose Hero-SR, a one-step diffusion-based SR framework explicitly designed with human perception priors. Hero-SR consists of two novel modules: the Dynamic Time-Step Module (DTSM), which adaptively selects optimal diffusion steps for flexibly meeting human perceptual standards, and the Open-World Multi-modality Supervision (OWMS), which integrates guidance from both image and text domains through CLIP to improve semantic consistency and perceptual naturalness. Through these modules, Hero-SR generates high-resolution images that not only preserve intricate details but also reflect human perceptual preferences. Extensive experiments validate that Hero-SR achieves state-of-the-art performance in Real-SR. The code will be publicly available upon paper acceptance.

Paper Structure

This paper contains 20 sections, 14 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Performance and Visual Comparison. (1) Performance Comparison: Compared to one-step and multi-step methods, Hero-SR achieves superior performance with just a single diffusion step. Tested on the DRealSR benchmark, all metrics are normalized using min-max scaling, with 'S' denoting the number of diffusion steps. (2) Visual Comparison: Hero-SR restores more realistic textures and aligns better with human perception, outperforming both one-step and multi-step methods. Zoom in for details.
  • Figure 2: Training framework of Hero-SR. Hero-SR incorporates a Dynamic Time-step Module to adaptively determine the optimal time-step $t^*$ based on the input image $I_{LR}$, flexibly meeting human perceptual standards. Both $I_{LR}$ and $t^*$ are then input to the diffusion network to generate the restored image $I_{SR}$. Text-Domain Perceptual Alignment Loss and Image-Domain Semantic Alignment Loss ensure semantic consistency and perceptual naturalness, aligning outputs with human perception.
  • Figure 3: The time-step selection process of DTSM. Previous one-step methods use a fixed starting time-step from pure noise, while DTSM adaptively selects a dynamic starting time-step based on the input image to better align with the diffusion process.
  • Figure 4: Qualitative comparison with one-step and multi-step methods. 'S' indicates the number of diffusion steps. Zoom in for details.
  • Figure 5: Qualitative comparison with GAN-base methods. Zoom in for details.
  • ...and 12 more figures