Table of Contents
Fetching ...

PG-ControlNet: A Physics-Guided ControlNet for Generative Spatially Varying Image Deblurring

Hakki Motorcu, Mujdat Cetin

TL;DR

Addressing the ill-posed problem of spatially varying blur, the paper introduces PG-ControlNet, a physics-guided conditional diffusion framework that represents the blur field as a dense, region-adaptive set of local kernels. Local kernels are compressed via PCA into a 128-dimensional descriptor field aligned to the image grid, which conditions a ControlNet-based diffusion model built on a frozen Stable Diffusion backbone. Only the hint encoder is trained, enabling posterior sampling that enforces data fidelity while preserving perceptual realism; experiments on 512x512 COCO-2017 data show superior perceptual metrics (LPIPS, FID, FSIM) with competitive fidelity, outperforming both model-based and diffusion baselines under challenging nonuniform blur. This approach demonstrates a practical route to combine physical measurements with generative priors, with broad implications for microscopy, aerial imaging, and depth-aware photography.

Abstract

Spatially varying image deblurring remains a fundamentally ill-posed problem, especially when degradations arise from complex mixtures of motion and other forms of blur under significant noise. State-of-the-art learning-based approaches generally fall into two paradigms: model-based deep unrolling methods that enforce physical constraints by modeling the degradations, but often produce over-smoothed, artifact-laden textures, and generative models that achieve superior perceptual quality yet hallucinate details due to weak physical constraints. In this paper, we propose a novel framework that uniquely reconciles these paradigms by taming a powerful generative prior with explicit, dense physical constraints. Rather than oversimplifying the degradation field, we model it as a dense continuum of high-dimensional compressed kernels, ensuring that minute variations in motion and other degradation patterns are captured. We leverage this rich descriptor field to condition a ControlNet architecture, strongly guiding the diffusion sampling process. Extensive experiments demonstrate that our method effectively bridges the gap between physical accuracy and perceptual realism, outperforming state-of-the-art model-based methods as well as generative baselines in challenging, severely blurred scenarios.

PG-ControlNet: A Physics-Guided ControlNet for Generative Spatially Varying Image Deblurring

TL;DR

Addressing the ill-posed problem of spatially varying blur, the paper introduces PG-ControlNet, a physics-guided conditional diffusion framework that represents the blur field as a dense, region-adaptive set of local kernels. Local kernels are compressed via PCA into a 128-dimensional descriptor field aligned to the image grid, which conditions a ControlNet-based diffusion model built on a frozen Stable Diffusion backbone. Only the hint encoder is trained, enabling posterior sampling that enforces data fidelity while preserving perceptual realism; experiments on 512x512 COCO-2017 data show superior perceptual metrics (LPIPS, FID, FSIM) with competitive fidelity, outperforming both model-based and diffusion baselines under challenging nonuniform blur. This approach demonstrates a practical route to combine physical measurements with generative priors, with broad implications for microscopy, aerial imaging, and depth-aware photography.

Abstract

Spatially varying image deblurring remains a fundamentally ill-posed problem, especially when degradations arise from complex mixtures of motion and other forms of blur under significant noise. State-of-the-art learning-based approaches generally fall into two paradigms: model-based deep unrolling methods that enforce physical constraints by modeling the degradations, but often produce over-smoothed, artifact-laden textures, and generative models that achieve superior perceptual quality yet hallucinate details due to weak physical constraints. In this paper, we propose a novel framework that uniquely reconciles these paradigms by taming a powerful generative prior with explicit, dense physical constraints. Rather than oversimplifying the degradation field, we model it as a dense continuum of high-dimensional compressed kernels, ensuring that minute variations in motion and other degradation patterns are captured. We leverage this rich descriptor field to condition a ControlNet architecture, strongly guiding the diffusion sampling process. Extensive experiments demonstrate that our method effectively bridges the gap between physical accuracy and perceptual realism, outperforming state-of-the-art model-based methods as well as generative baselines in challenging, severely blurred scenarios.

Paper Structure

This paper contains 13 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Spatially varying data generation and dense blur descriptor construction pipeline. (Top) Soft region masks derived from segmentation maps are used to assign region-specific point spread functions (PSFs) $K_i$ to different objects, synthesizing the blurred observation $\mathbf{y}$ according to Eq. \ref{['eq:forward_model']}. (Bottom) To construct the dense descriptor field $\mathbf{D}$, each $K_i$ is vectorized and PCA-compressed. These embeddings are then spatially distributed by weighting them with the same smooth normalized maps to ensure continuous transitions at region boundaries. (Right) The final input to the ControlNet Hint Encoder is a concatenated tensor formed by stacking the blurry image (green block) and the dense descriptor field (blue block) channel-wise.The red frame highlights an example stack of PCA vectors on a segmentation map transition region, illustrating the interpolated conditioning signal resulting from the weighted combination of neighboring kernel embeddings.
  • Figure 2: Qualitative comparison on diverse scenes with spatially varying blur. Columns show: input, ground truth, Restormer restormer, MIMO-UNet++ mimounet, DeblurDiff kong2025deblurdiff,DMBSR dmbsr, our non-generative Convnext-UNet, and the proposed PG-ControlNet. Zoomed regions are shown beneath each reconstruction.
  • Figure 3: Overview of the proposed PG-ControlNet framework. The frozen Stable Diffusion 1.5 backbone and VAE are depicted blue. The text encoder and time inputs are omitted for simplicity. The ControlNet and the hint encoder are trained on concatenated inputs of the blurry image $\mathbf{y}$ and its dense blur descriptor field $\mathbf{D}$.
  • Figure 4: Ablation and limitation analysis. Row 1: Effect of removing the kernel field. Row 2: Robustness under noisy kernels. Rows 3-4: Limitations inherited from the SD-1.5 backbone on text and faces.