Table of Contents
Fetching ...

High Perceptual Quality Image Denoising with a Posterior Sampling CGAN

Guy Ohayon, Theo Adrai, Gregory Vaksman, Michael Elad, Peyman Milanfar

TL;DR

The paper tackles the perceptual quality vs. distortion dilemma in image denoising by proposing PSCGAN, a posterior-sampling CGAN that samples from $\mathbb{P}_{\mathbf{x}|\mathbf{y}}$ and introduces a mean-consistency penalty to prevent mode collapse. The method uses a StyleGAN2/UNet-inspired encoder–decoder with multi-scale noise injections and a gradient-penalized WGAN objective, enabling diverse, sharp denoised outputs while preserving fidelity on average. A dual-capability denoiser is presented: PSCGAN for posterior sampling and PSCGAN-A that averages multiple samples to approximate MMSE, with experiments on FFHQ and LSUN datasets showing strong perceptual quality (low FID) and competitive PSNR relative to MMSE, across noise levels. The results highlight the method’s ability to traverse the perception-distortion tradeoff, maintain Gaussian-like remainder noise, and provide practically valuable, non-blurry denoising suitable for high-noise scenarios.

Abstract

The vast work in Deep Learning (DL) has led to a leap in image denoising research. Most DL solutions for this task have chosen to put their efforts on the denoiser's architecture while maximizing distortion performance. However, distortion driven solutions lead to blurry results with sub-optimal perceptual quality, especially in immoderate noise levels. In this paper we propose a different perspective, aiming to produce sharp and visually pleasing denoised images that are still faithful to their clean sources. Formally, our goal is to achieve high perceptual quality with acceptable distortion. This is attained by a stochastic denoiser that samples from the posterior distribution, trained as a generator in the framework of conditional generative adversarial networks (CGAN). Contrary to distortion-based regularization terms that conflict with perceptual quality, we introduce to the CGAN objective a theoretically founded penalty term that does not force a distortion requirement on individual samples, but rather on their mean. We showcase our proposed method with a novel denoiser architecture that achieves the reformed denoising goal and produces vivid and diverse outcomes in immoderate noise levels.

High Perceptual Quality Image Denoising with a Posterior Sampling CGAN

TL;DR

The paper tackles the perceptual quality vs. distortion dilemma in image denoising by proposing PSCGAN, a posterior-sampling CGAN that samples from and introduces a mean-consistency penalty to prevent mode collapse. The method uses a StyleGAN2/UNet-inspired encoder–decoder with multi-scale noise injections and a gradient-penalized WGAN objective, enabling diverse, sharp denoised outputs while preserving fidelity on average. A dual-capability denoiser is presented: PSCGAN for posterior sampling and PSCGAN-A that averages multiple samples to approximate MMSE, with experiments on FFHQ and LSUN datasets showing strong perceptual quality (low FID) and competitive PSNR relative to MMSE, across noise levels. The results highlight the method’s ability to traverse the perception-distortion tradeoff, maintain Gaussian-like remainder noise, and provide practically valuable, non-blurry denoising suitable for high-noise scenarios.

Abstract

The vast work in Deep Learning (DL) has led to a leap in image denoising research. Most DL solutions for this task have chosen to put their efforts on the denoiser's architecture while maximizing distortion performance. However, distortion driven solutions lead to blurry results with sub-optimal perceptual quality, especially in immoderate noise levels. In this paper we propose a different perspective, aiming to produce sharp and visually pleasing denoised images that are still faithful to their clean sources. Formally, our goal is to achieve high perceptual quality with acceptable distortion. This is attained by a stochastic denoiser that samples from the posterior distribution, trained as a generator in the framework of conditional generative adversarial networks (CGAN). Contrary to distortion-based regularization terms that conflict with perceptual quality, we introduce to the CGAN objective a theoretically founded penalty term that does not force a distortion requirement on individual samples, but rather on their mean. We showcase our proposed method with a novel denoiser architecture that achieves the reformed denoising goal and produces vivid and diverse outcomes in immoderate noise levels.

Paper Structure

This paper contains 18 sections, 9 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Denoising results on the FFHQ test set produced by several methods. PSCGAN is a sampled denoised image produced by our proposed method, attained by injecting noise with standard deviation of $\sigma_{\mathbf{z}}=1$ (at both training and inference time). $\text{Ours-LAG}(\tiny\sigma_{\mathbf{z}}=0)$ and $\text{Ours-LAG}(\tiny\sigma_{\mathbf{z}}=1)$ are the same models, while the former is with $\sigma_{\mathbf{z}}=0$ and the latter is with $\sigma_{\mathbf{z}}=1$ at inference time. In this case, PSCGAN-A averages 64 instances of PSCGAN. Each model was trained on the FFHQ training set to denoise a specific noise level ($25, 50$ or $75$).
  • Figure 2: Stochastic variation of denoised images attained by 3 different generators, each trained with PSCGAN to denoise images contaminated with noise levels of $\sigma=25,50,75$. Two clean images are presented to the left, and their corresponding noisy versions to their right. Alongside each noisy input we show 4 examples of possible denoising outcomes, as well as the $4^{\text{th}}$ root of the per-pixel standard deviation image calculated on 32 samples. For convenience, a gray-scale color map is added to the right (white and black correspond to low and high standard deviations, respectively). All denoised image samples were obtained by injecting noise with $\sigma_{\mathbf{z}}=1$ at inference time.
  • Figure 3: FID versus PSNR results for PSCGAN, PSCGAN-A that averages $N$ PSCGAN instances, Ours-LAG, Ours-LAG-A that averages $N$ Ours-LAG instances, Ours-MSE and DnCNN. The noise contamination level ($\sigma=25,50,75$) is given with parentheses next to the name of each method. PSCGAN and PSCGAN-A are evaluated on different choices of $\sigma_{\mathbf{z}}$ and $N$ during inference, while $\sigma_{\mathbf{z}}$ is fixed to 1 when varying $N$. Ours-LAG and Ours-LAG-A are evaluated in the same fashion. For PSCGAN and Ours-LAG the values of $\sigma_{\mathbf{z}}$ are given next to each marked point. Similarly, the values of $N$ are given for PSCGAN-A. The performance results of the MSE based methods are also plotted.
  • Figure 4: The approximated p.d.f of the patch-RMSE and of the local-remainder-noise-RMS obtained by PSCGAN and by Ours-MSE, and the approximated p.d.f of the local-noise-RMS obtained by Gaussian noise.
  • Figure 5: Our proposed generator architecture. An input noisy image is passed through an encoder of $i$ doubly-blocked convolutional layers and downsampled after each (except for the first block). The downsampling operation is performed by a stride of $2$ in the preceding convolution layer. The result of each doubly-blocked layer is then passed through a Drip, which is a feed-forward CNN (in the figure, $d_{k}$ and $c_{d_{k}}$ denote the number of layers and the number of output channels of each layer in drip $k$, respectively). Each of these drips extracts features that are limited to a certain receptive field, which are then passed to the neighboring decoder block through concatenation. This further assists in reconstructing the RGB result of the corresponding scale, especially at higher scales. The decoder builds the reconstructed image scale by scale, using features aggregated from previous layers of the decoder and from the drip injections. Noise injections are performed in the decoder's pipeline, where a noisy "image", denoted as $z_{k,1}$ and $z_{k,2}$ for each $1\leq k\leq i$, is concatenated as another feature map of the next layer's input. All convolutional layers, except for $tRGB$, are coupled with Leaky ReLU activation functions with a slope of $\alpha=0.2$ for negative values. $tRGB$ is a simple convolution operation with output channels being equal to the number of channels of the input image (3 for RGB images). All up-sampling operations are performed with nearest-neighbor interpolation.
  • ...and 3 more figures