Table of Contents
Fetching ...

Controlling the Latent Diffusion Model for Generative Image Shadow Removal via Residual Generation

Xinjie Li, Yang Zhao, Dong Wang, Yuan Chen, Li Cao, Xiaoping Liu

TL;DR

This work tackles the challenge of removing complex shadows while preserving image content by leveraging a pre-trained latent diffusion model to generate and refine shadow residuals rather than recreating shadow-free images from scratch. It introduces a Noise-Residual Decomposition (NRD) and a shadow residual schedule that integrate with a ControlNet-inspired backbone, coupled with cross-timestep self-enhanced training using an EMA-based replica to augment data and reduce error accumulation. A detail-preserving decoder built on a frozen VQ-GAN with multi-scale skip connections and Zero-Deconv further protects high-frequency details during reconstruction. On ISTD+ and SRD benchmarks, the method achieves state-of-the-art perceptual quality (LPIPS, FID) while maintaining competitive pixel-level fidelity (PSNR/SSIM) and effectively preserves original content in shadow regions, demonstrating the practical utility of pre-trained diffusion priors for high-fidelity shadow removal.

Abstract

Large-scale generative models have achieved remarkable advancements in various visual tasks, yet their application to shadow removal in images remains challenging. These models often generate diverse, realistic details without adequate focus on fidelity, failing to meet the crucial requirements of shadow removal, which necessitates precise preservation of image content. In contrast to prior approaches that aimed to regenerate shadow-free images from scratch, this paper utilizes diffusion models to generate and refine image residuals. This strategy fully uses the inherent detailed information within shadowed images, resulting in a more efficient and faithful reconstruction of shadow-free content. Additionally, to revent the accumulation of errors during the generation process, a crosstimestep self-enhancement training strategy is proposed. This strategy leverages the network itself to augment the training data, not only increasing the volume of data but also enabling the network to dynamically correct its generation trajectory, ensuring a more accurate and robust output. In addition, to address the loss of original details in the process of image encoding and decoding of large generative models, a content-preserved encoder-decoder structure is designed with a control mechanism and multi-scale skip connections to achieve high-fidelity shadow-free image reconstruction. Experimental results demonstrate that the proposed method can reproduce high-quality results based on a large latent diffusion prior and faithfully preserve the original contents in shadow regions.

Controlling the Latent Diffusion Model for Generative Image Shadow Removal via Residual Generation

TL;DR

This work tackles the challenge of removing complex shadows while preserving image content by leveraging a pre-trained latent diffusion model to generate and refine shadow residuals rather than recreating shadow-free images from scratch. It introduces a Noise-Residual Decomposition (NRD) and a shadow residual schedule that integrate with a ControlNet-inspired backbone, coupled with cross-timestep self-enhanced training using an EMA-based replica to augment data and reduce error accumulation. A detail-preserving decoder built on a frozen VQ-GAN with multi-scale skip connections and Zero-Deconv further protects high-frequency details during reconstruction. On ISTD+ and SRD benchmarks, the method achieves state-of-the-art perceptual quality (LPIPS, FID) while maintaining competitive pixel-level fidelity (PSNR/SSIM) and effectively preserves original content in shadow regions, demonstrating the practical utility of pre-trained diffusion priors for high-fidelity shadow removal.

Abstract

Large-scale generative models have achieved remarkable advancements in various visual tasks, yet their application to shadow removal in images remains challenging. These models often generate diverse, realistic details without adequate focus on fidelity, failing to meet the crucial requirements of shadow removal, which necessitates precise preservation of image content. In contrast to prior approaches that aimed to regenerate shadow-free images from scratch, this paper utilizes diffusion models to generate and refine image residuals. This strategy fully uses the inherent detailed information within shadowed images, resulting in a more efficient and faithful reconstruction of shadow-free content. Additionally, to revent the accumulation of errors during the generation process, a crosstimestep self-enhancement training strategy is proposed. This strategy leverages the network itself to augment the training data, not only increasing the volume of data but also enabling the network to dynamically correct its generation trajectory, ensuring a more accurate and robust output. In addition, to address the loss of original details in the process of image encoding and decoding of large generative models, a content-preserved encoder-decoder structure is designed with a control mechanism and multi-scale skip connections to achieve high-fidelity shadow-free image reconstruction. Experimental results demonstrate that the proposed method can reproduce high-quality results based on a large latent diffusion prior and faithfully preserve the original contents in shadow regions.

Paper Structure

This paper contains 16 sections, 15 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Current SOTA algorithms still cannot completely remove complex shadows. Owing to large-scale latent diffusion prior and the proposed residual generation diffusion, the proposed method can effectively remove shadows while faithfully preserving the image content.
  • Figure 2: Diffusion backward processes of different methods. (a) Denoising Diffusion Implicit Models (DDIM). (b) Residual Denoising Diffusion Models (RDDM). (c) the proposed residual generation model.
  • Figure 3: Flowchart of the training phase of the proposed method.
  • Figure 4: Flowchart of the inference (sampling) phase of the proposed method. The latent $z_t$ and shadow residual term $\left(\bar{\beta}_{t-1}-1\right) \cdot \hat{r}_t$ in Eq. \ref{['eq11']} are also visualized for better understanding.
  • Figure 5: The schematic illustration of our training strategy.
  • ...and 5 more figures