Table of Contents
Fetching ...

DiffLoss: unleashing diffusion model as constraint for training image restoration network

Jiangtong Tan, Feng Zhao

TL;DR

Image restoration must balance perceptual naturalness with semantic fidelity under varying degradations. The authors propose DiffLoss, a training-time, diffusion-based prior that does not increase inference cost, leveraging a fixed unconditional diffusion model to constrain restorations in two ways: (i) naturalness through projection into the diffusion sampling space via a forward diffusion step, and (ii) semantic preservation through h-space bottleneck features. DiffLoss defines two losses, $L_{nat}$ and $L_{sem}$, forming $L_{DiffLoss}=L_{nat}+\lambda L_{sem}$ and combines with a standard data fidelity term to yield $L_{total}=\|x - z\|_2 + \gamma L_{DiffLoss}$. Extensive experiments across low-light enhancement, deraining, and dehazing show that DiffLoss improves naturalness and semantic perception, enabling lightweight restorers to achieve higher perceptual quality without extra inference cost.

Abstract

Image restoration aims to enhance low quality images, producing high quality images that exhibit natural visual characteristics and fine semantic attributes. Recently, the diffusion model has emerged as a powerful technique for image generation, and it has been explicitly employed as a backbone in image restoration tasks, yielding excellent results. However, it suffers from the drawbacks of slow inference speed and large model parameters due to its intrinsic characteristics. In this paper, we introduce a new perspective that implicitly leverages the diffusion model to assist the training of image restoration network, called DiffLoss, which drives the restoration results to be optimized for naturalness and semantic-aware visual effect. To achieve this, we utilize the mode coverage capability of the diffusion model to approximate the distribution of natural images and explore its ability to capture image semantic attributes. On the one hand, we extract intermediate noise to leverage its modeling capability of the distribution of natural images, which serves as a naturalness-oriented optimization space. On the other hand, we utilize the bottleneck features of diffusion model to harness its semantic attributes serving as a constraint on semantic level. By combining these two designs, the overall loss function is able to improve the perceptual quality of image restoration, resulting in visually pleasing and semantically enhanced outcomes. To validate the effectiveness of our method, we conduct experiments on various common image restoration tasks and benchmarks. Extensive experimental results demonstrate that our approach enhances the visual quality and semantic perception of the restoration network.

DiffLoss: unleashing diffusion model as constraint for training image restoration network

TL;DR

Image restoration must balance perceptual naturalness with semantic fidelity under varying degradations. The authors propose DiffLoss, a training-time, diffusion-based prior that does not increase inference cost, leveraging a fixed unconditional diffusion model to constrain restorations in two ways: (i) naturalness through projection into the diffusion sampling space via a forward diffusion step, and (ii) semantic preservation through h-space bottleneck features. DiffLoss defines two losses, and , forming and combines with a standard data fidelity term to yield . Extensive experiments across low-light enhancement, deraining, and dehazing show that DiffLoss improves naturalness and semantic perception, enabling lightweight restorers to achieve higher perceptual quality without extra inference cost.

Abstract

Image restoration aims to enhance low quality images, producing high quality images that exhibit natural visual characteristics and fine semantic attributes. Recently, the diffusion model has emerged as a powerful technique for image generation, and it has been explicitly employed as a backbone in image restoration tasks, yielding excellent results. However, it suffers from the drawbacks of slow inference speed and large model parameters due to its intrinsic characteristics. In this paper, we introduce a new perspective that implicitly leverages the diffusion model to assist the training of image restoration network, called DiffLoss, which drives the restoration results to be optimized for naturalness and semantic-aware visual effect. To achieve this, we utilize the mode coverage capability of the diffusion model to approximate the distribution of natural images and explore its ability to capture image semantic attributes. On the one hand, we extract intermediate noise to leverage its modeling capability of the distribution of natural images, which serves as a naturalness-oriented optimization space. On the other hand, we utilize the bottleneck features of diffusion model to harness its semantic attributes serving as a constraint on semantic level. By combining these two designs, the overall loss function is able to improve the perceptual quality of image restoration, resulting in visually pleasing and semantically enhanced outcomes. To validate the effectiveness of our method, we conduct experiments on various common image restoration tasks and benchmarks. Extensive experimental results demonstrate that our approach enhances the visual quality and semantic perception of the restoration network.

Paper Structure

This paper contains 13 sections, 12 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: (a): Illustration of the effect of our method to convert pixel constraint into distribution constraint.(b): Visual comparison between baselines with and without our DiffLoss. The top is produced with IATcui2022you on LOL datasetwei2018deep and the bottom is produced with MSBDNdong2020multi on NH-HAZE dataset ancuti2020nh. Previous loss is limited to the pixel level and suffers from unnaturalness problem, with color shift and content artifacts. Our DiffLoss leverages the powerful modeling capability of the diffusion model on the distribution of natural images, resulting in more natural outcomes in the image restoration process. Need to mention that DiffLoss is an optimization strategy. The improvement should be compared with the baseline method, instead of other restoration methods or the ground-truth images.
  • Figure 2: As h-space changes, the image gradually loses its original semantics completely, from ① to ④ in (a). We also employed the output features of the ResNet50 network to measure the distance of high-level features in the images with different types of degradation to show the change of semantic attributes, as depicted in the histogram (b). With h-space perturbation increase, clean images and restored images with DiffLoss exhibit systematic variations in semantic attributes, while the degraded images and restored images without DiffLoss show minimal changes, which means degradations undermine the semantic attributes of images, while our DiffLoss can restore this. Note that diffusion model and ResNet50 are both trained on ImageNet dataset.
  • Figure 3: Overview structure of our method. The parameter of DiffLoss is frozen during training stage. For any existing restoration network, we train it with the aid of our DiffLoss to achieve higher natural visual and semantic performance. More implementation details of DiffLoss can be found in \ref{['fig:ppl2']}. During inference stage, we only have the optimized restoration network, without involving the DiffLoss.
  • Figure 4: Detailed design of DiffLoss. We devise our DiffLoss with $t$ step forward process and one step reverse process. The $t$ step forward process can be directly achieved with \ref{['7']}. Then we projecting these noisy images into intermediate noise with \ref{['4']} after fed into the denoising UNet. We also get h-space vector from bottleneck of the UNet,which contains semantic information, as described in \ref{['12']} and \ref{['13']}. The DiffLoss is designed to pull the output of the denoising UNet and the bottleneck feature closer.
  • Figure 5: Comparison of visual results on Rain100H and LOL datasets. (a):EfDeRain; (b): RCD-Net; (c): IAT; (d): DeepLPF. Please zoom in for best view.
  • ...and 2 more figures