Table of Contents
Fetching ...

Adaptive Multi-modal Fusion of Spatially Variant Kernel Refinement with Diffusion Model for Blind Image Super-Resolution

Junxiong Lin, Yan Wang, Zeng Tao, Boyang Wang, Qing Zhao, Haorang Wang, Xuan Tong, Xinji Mai, Yuxuan Lin, Wei Song, Jiawen Yu, Shaoqi Yan, Wenqiang Zhang

TL;DR

The paper introduces SSR, a blind image super-resolution framework that leverages diffusion priors while explicitly modeling spatially variant degradation through a Depth-Informed Kernel (DI-Kernel) and a Spatially Variant Kernel Refinement (SVKR). A three-modal Adaptive Multi-Modal Fusion (AMF) module fuses low-resolution images, monocular depth, and blur kernels to constrain the diffusion process, improving realism and fidelity. The approach combines iterative depth/kernel refinement with a lightweight non-blind SR step and diffusion conditioning, achieving state-of-the-art results on multiple benchmarks and demonstrating robustness to diverse degradation modes. The work highlights the value of multimodal conditioning in diffusion-based SR and suggests extensions to related low-level vision tasks.

Abstract

Pre-trained diffusion models utilized for image generation encapsulate a substantial reservoir of a priori knowledge pertaining to intricate textures. Harnessing the potential of leveraging this a priori knowledge in the context of image super-resolution presents a compelling avenue. Nonetheless, prevailing diffusion-based methodologies presently overlook the constraints imposed by degradation information on the diffusion process. Furthermore, these methods fail to consider the spatial variability inherent in the estimated blur kernel, stemming from factors such as motion jitter and out-of-focus elements in open-environment scenarios. This oversight results in a notable deviation of the image super-resolution effect from fundamental realities. To address these concerns, we introduce a framework known as Adaptive Multi-modal Fusion of \textbf{S}patially Variant Kernel Refinement with Diffusion Model for Blind Image \textbf{S}uper-\textbf{R}esolution (SSR). Within the SSR framework, we propose a Spatially Variant Kernel Refinement (SVKR) module. SVKR estimates a Depth-Informed Kernel, which takes the depth information into account and is spatially variant. Additionally, SVKR enhance the accuracy of depth information acquired from LR images, allowing for mutual enhancement between the depth map and blur kernel estimates. Finally, we introduce the Adaptive Multi-Modal Fusion (AMF) module to align the information from three modalities: low-resolution images, depth maps, and blur kernels. This alignment can constrain the diffusion model to generate more authentic SR results.

Adaptive Multi-modal Fusion of Spatially Variant Kernel Refinement with Diffusion Model for Blind Image Super-Resolution

TL;DR

The paper introduces SSR, a blind image super-resolution framework that leverages diffusion priors while explicitly modeling spatially variant degradation through a Depth-Informed Kernel (DI-Kernel) and a Spatially Variant Kernel Refinement (SVKR). A three-modal Adaptive Multi-Modal Fusion (AMF) module fuses low-resolution images, monocular depth, and blur kernels to constrain the diffusion process, improving realism and fidelity. The approach combines iterative depth/kernel refinement with a lightweight non-blind SR step and diffusion conditioning, achieving state-of-the-art results on multiple benchmarks and demonstrating robustness to diverse degradation modes. The work highlights the value of multimodal conditioning in diffusion-based SR and suggests extensions to related low-level vision tasks.

Abstract

Pre-trained diffusion models utilized for image generation encapsulate a substantial reservoir of a priori knowledge pertaining to intricate textures. Harnessing the potential of leveraging this a priori knowledge in the context of image super-resolution presents a compelling avenue. Nonetheless, prevailing diffusion-based methodologies presently overlook the constraints imposed by degradation information on the diffusion process. Furthermore, these methods fail to consider the spatial variability inherent in the estimated blur kernel, stemming from factors such as motion jitter and out-of-focus elements in open-environment scenarios. This oversight results in a notable deviation of the image super-resolution effect from fundamental realities. To address these concerns, we introduce a framework known as Adaptive Multi-modal Fusion of \textbf{S}patially Variant Kernel Refinement with Diffusion Model for Blind Image \textbf{S}uper-\textbf{R}esolution (SSR). Within the SSR framework, we propose a Spatially Variant Kernel Refinement (SVKR) module. SVKR estimates a Depth-Informed Kernel, which takes the depth information into account and is spatially variant. Additionally, SVKR enhance the accuracy of depth information acquired from LR images, allowing for mutual enhancement between the depth map and blur kernel estimates. Finally, we introduce the Adaptive Multi-Modal Fusion (AMF) module to align the information from three modalities: low-resolution images, depth maps, and blur kernels. This alignment can constrain the diffusion model to generate more authentic SR results.
Paper Structure (14 sections, 5 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 14 sections, 5 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: Visual comparison (×4) on DRealSR. SwinIR, MANet, HAT, RealESRGAN, DASR and ResShift suffer from noise and blurring artifacts, while SSR can generate high-fidelity images.
  • Figure 2: The illustration of various blind super-resolution methods. (a) The majority of super-resolution methods assume that the image degradation process is spatially invariant, estimating only a single blur kernel for an individual image. (b) Diffusion based super-resolution methods using texture prior information. (c) Our SSR approach, which imposes constraints on the diffusion solution space through spatially variant blur kernels and depth information.
  • Figure 3: The framework of the proposed Adaptive Multi-modal Fusion of Spatially Variant Kernel Refinement with Diffusion Model for Blind Image Super-Resolution (SSR). (a) Illustration of the main process of SSR. (b) Depiction of the Depth-Informed Kernel Estimate Network (DKENet) for spatially variant kernel estimation. (c) Depiction of the Adaptive Multi-Modal Fusion (AMF) module for information fusion.
  • Figure 4: The visualization of depth maps and DI-Kernels during the SVKR iteration process, where depth information and degradation information are mutually enhanced during the iteration process.
  • Figure 5: Visual comparisons of several representative methods on examples of the DIV2K dataset. Zoom in for best view.
  • ...and 4 more figures