Table of Contents
Fetching ...

SAM-DiffSR: Structure-Modulated Diffusion Model for Image Super-Resolution

Chengcheng Wang, Zhiwei Hao, Yehui Tang, Jianyuan Guo, Yujie Yang, Kai Han, Yunhe Wang

TL;DR

This paper tackles the challenge of structure-aware texture restoration in diffusion-based single-image super-resolution. It introduces SAM-DiffSR, a framework that modulates the forward diffusion noise mean using SPE-encoded SAM masks, enabling region-specific restoration without adding inference cost. The method retains the original reverse diffusion process and only requires SAM during training, achieving superior PSNR and reduced artifacts on DIV2K and related benchmarks. Its core contribution lies in integrating fine-grained structure information into diffusion training, improving texture fidelity while maintaining practical efficiency for real-world SR tasks.

Abstract

Diffusion-based super-resolution (SR) models have recently garnered significant attention due to their potent restoration capabilities. But conventional diffusion models perform noise sampling from a single distribution, constraining their ability to handle real-world scenes and complex textures across semantic regions. With the success of segment anything model (SAM), generating sufficiently fine-grained region masks can enhance the detail recovery of diffusion-based SR model. However, directly integrating SAM into SR models will result in much higher computational cost. In this paper, we propose the SAM-DiffSR model, which can utilize the fine-grained structure information from SAM in the process of sampling noise to improve the image quality without additional computational cost during inference. In the process of training, we encode structural position information into the segmentation mask from SAM. Then the encoded mask is integrated into the forward diffusion process by modulating it to the sampled noise. This adjustment allows us to independently adapt the noise mean within each corresponding segmentation area. The diffusion model is trained to estimate this modulated noise. Crucially, our proposed framework does NOT change the reverse diffusion process and does NOT require SAM at inference. Experimental results demonstrate the effectiveness of our proposed method, showcasing superior performance in suppressing artifacts, and surpassing existing diffusion-based methods by 0.74 dB at the maximum in terms of PSNR on DIV2K dataset. The code and dataset are available at https://github.com/lose4578/SAM-DiffSR.

SAM-DiffSR: Structure-Modulated Diffusion Model for Image Super-Resolution

TL;DR

This paper tackles the challenge of structure-aware texture restoration in diffusion-based single-image super-resolution. It introduces SAM-DiffSR, a framework that modulates the forward diffusion noise mean using SPE-encoded SAM masks, enabling region-specific restoration without adding inference cost. The method retains the original reverse diffusion process and only requires SAM during training, achieving superior PSNR and reduced artifacts on DIV2K and related benchmarks. Its core contribution lies in integrating fine-grained structure information into diffusion training, improving texture fidelity while maintaining practical efficiency for real-world SR tasks.

Abstract

Diffusion-based super-resolution (SR) models have recently garnered significant attention due to their potent restoration capabilities. But conventional diffusion models perform noise sampling from a single distribution, constraining their ability to handle real-world scenes and complex textures across semantic regions. With the success of segment anything model (SAM), generating sufficiently fine-grained region masks can enhance the detail recovery of diffusion-based SR model. However, directly integrating SAM into SR models will result in much higher computational cost. In this paper, we propose the SAM-DiffSR model, which can utilize the fine-grained structure information from SAM in the process of sampling noise to improve the image quality without additional computational cost during inference. In the process of training, we encode structural position information into the segmentation mask from SAM. Then the encoded mask is integrated into the forward diffusion process by modulating it to the sampled noise. This adjustment allows us to independently adapt the noise mean within each corresponding segmentation area. The diffusion model is trained to estimate this modulated noise. Crucially, our proposed framework does NOT change the reverse diffusion process and does NOT require SAM at inference. Experimental results demonstrate the effectiveness of our proposed method, showcasing superior performance in suppressing artifacts, and surpassing existing diffusion-based methods by 0.74 dB at the maximum in terms of PSNR on DIV2K dataset. The code and dataset are available at https://github.com/lose4578/SAM-DiffSR.
Paper Structure (28 sections, 16 equations, 9 figures, 7 tables)

This paper contains 28 sections, 16 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: (A) is comparison of noise distribution in the forward diffusion process between existing diffusion-based image SR methods and our SAM-DiffSR. Our approach enhances the restoration of different image areas by modulating the corresponding noise with guidance from segmentation masks generated by SAM. (B) is Visualization of restored images generated by different methods. Our method can achieve similar reconstruction performance to directly integrating SAM into diffusion model.
  • Figure 2: We compared the metrics MANIQA, FID, PSNR, and Artifact(\ref{['sec:artifact']}) on the DIV2K dataset. In this context, higher values of MANIQA and PSNR are better, while lower values of FID and Artifact are preferred. The red arrow indicates the direction of the best performance based on the combined horizontal and vertical metrics.
  • Figure 3: Comparison between (a) directly integrating SAM into the diffusion model and (b) our proposed SAM-DiffSR reveals distinct approaches, and the PSNR evaluate on DIV2K dateset. In (a), mask information predicted by SAM is utilized during both the training and inference stages. In contrast, (b) only employs modulated noise generated by the structural noise modulation model during training. The details of structural noise modulation can by found in Figure \ref{['fig:snm_spe']}(a), and our method achieves comparable reconstruction performance to (b) as demonstrated in Figure \ref{['fig:cover_comp']}(B).
  • Figure 4: (a) During training, a SAM generates a segmentation mask for an HR image, and a structural position encoding (SPE) module encodes structure-level position information in the mask. The encoded mask is then added to the noise to modulate its mean in each segmentation area separately. At inference time, the framework utilizes only the trained diffusion model for image restoration, eliminating the inference cost of SAM. (b) This module encodes structural position information in the mask generated by SAM.
  • Figure 5: Visualization of restored images generated by different methods. Our SAM-DiffSR surpasses other approaches in terms of both higher reconstruction quality and fewer artifacts. Additional visualization results can be found in our supplementary material.
  • ...and 4 more figures