Table of Contents
Fetching ...

Detail-Preserving Latent Diffusion for Stable Shadow Removal

Jiamin Xu, Yuxin Zheng, Zelong Li, Chi Wang, Renshu Gu, Weiwei Xu, Gang Xu

TL;DR

This work tackles the generalization gap in shadow removal by leveraging a pre-trained Stable Diffusion model through a two-stage pipeline. The first stage performs shadow removal in a fixed-VAE latent space by fine-tuning the denoiser, yielding strong global removal but potentially losing fine details. The second stage introduces a shadow-aware detail injection module that uses encoder features and DINO cues to inject shadow-free detail back into the decoder, preserving high-frequency texture. Across four datasets and cross-dataset evaluations, the method achieves state-of-the-art or competitive results, with notably improved generalization in unseen domains. The approach offers a practical, mask-free path to high-quality, detail-preserving shadow removal at high resolutions, with room for enhancement via unsupervised signals in future work.

Abstract

Achieving high-quality shadow removal with strong generalizability is challenging in scenes with complex global illumination. Due to the limited diversity in shadow removal datasets, current methods are prone to overfitting training data, often leading to reduced performance on unseen cases. To address this, we leverage the rich visual priors of a pre-trained Stable Diffusion (SD) model and propose a two-stage fine-tuning pipeline to adapt the SD model for stable and efficient shadow removal. In the first stage, we fix the VAE and fine-tune the denoiser in latent space, which yields substantial shadow removal but may lose some high-frequency details. To resolve this, we introduce a second stage, called the detail injection stage. This stage selectively extracts features from the VAE encoder to modulate the decoder, injecting fine details into the final results. Experimental results show that our method outperforms state-of-the-art shadow removal techniques. The cross-dataset evaluation further demonstrates that our method generalizes effectively to unseen data, enhancing the applicability of shadow removal methods.

Detail-Preserving Latent Diffusion for Stable Shadow Removal

TL;DR

This work tackles the generalization gap in shadow removal by leveraging a pre-trained Stable Diffusion model through a two-stage pipeline. The first stage performs shadow removal in a fixed-VAE latent space by fine-tuning the denoiser, yielding strong global removal but potentially losing fine details. The second stage introduces a shadow-aware detail injection module that uses encoder features and DINO cues to inject shadow-free detail back into the decoder, preserving high-frequency texture. Across four datasets and cross-dataset evaluations, the method achieves state-of-the-art or competitive results, with notably improved generalization in unseen domains. The approach offers a practical, mask-free path to high-quality, detail-preserving shadow removal at high resolutions, with room for enhancement via unsupervised signals in future work.

Abstract

Achieving high-quality shadow removal with strong generalizability is challenging in scenes with complex global illumination. Due to the limited diversity in shadow removal datasets, current methods are prone to overfitting training data, often leading to reduced performance on unseen cases. To address this, we leverage the rich visual priors of a pre-trained Stable Diffusion (SD) model and propose a two-stage fine-tuning pipeline to adapt the SD model for stable and efficient shadow removal. In the first stage, we fix the VAE and fine-tune the denoiser in latent space, which yields substantial shadow removal but may lose some high-frequency details. To resolve this, we introduce a second stage, called the detail injection stage. This stage selectively extracts features from the VAE encoder to modulate the decoder, injecting fine details into the final results. Experimental results show that our method outperforms state-of-the-art shadow removal techniques. The cross-dataset evaluation further demonstrates that our method generalizes effectively to unseen data, enhancing the applicability of shadow removal methods.

Paper Structure

This paper contains 22 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Top: For complex shadows in indoor scenes, current methods struggle to completely remove shadows, such as those behind the eagle sculpture. Bottom: Our method effectively removes the shadows, and with our second stage, the details from the original input, such as the wooden texture shown in the cropped image, are preserved.
  • Figure 2: Our proposed network. We propose a two-stage shadow removal network based on Stable Diffusion (SD). (1) In the first stage, as shown in the bottom half, we fine-tune the pre-trained UNet in SD within the latent space defined by SD's pre-trained VAE ($\mathcal{E}$ and $\mathcal{D}$). We found that the pre-trained latent space can effectively represent shadow-free images. (2) In the second stage, as shown in the top half, we modulate the VAE decoder $\mathcal{D}$ by selectively adding features from the VAE encoder $\mathcal{E}$ using a Detail Injection Model (DIM). The model consists of multiple RRDB layers, which inject shadow-free texture details into the decoder features. With these two stages, our proposed network can generate high-quality, shadow-free images that preserve fine details.
  • Figure 3: Latent space in VAE. We apply the pre-trained encoding and decoding process on a shadow image $\mathbf{x}$ or a shadow-free image $\mathbf{y}$. For a shadow-free image, this process does not introduce additional shadows, meaning the pre-trained latent space can also represent the shadow-free image. However, as shown in the zoomed-in view, some details, like text, may be lost in this process.
  • Figure 4: Some details in the input image, such as the textures on the floor, change during latent space diffusion in stage one. In stage two, our method preserves these high-quality details while effectively removing shadows. As shown in the bottom-left image, where we map the RRDB features to three dimensions using PCA, we observe that shadow areas have distinct features (in green), indicating that our detail injection model is shadow-aware.
  • Figure 5: Comparisons with SOTA shadow removal methods on the ISTD+ and SRD datasets. The input mask, required by methods that are not mask-free, is shown in the top-left corner.
  • ...and 1 more figures