Table of Contents
Fetching ...

Latent Feature-Guided Diffusion Models for Shadow Removal

Kangfu Mei, Luis Figueroa, Zhe Lin, Zhihong Ding, Scott Cohen, Vishal M. Patel

TL;DR

This work tackles shadow removal by reformulating it as a diffusion-model restoration problem conditioned on shadows. It introduces a learnable latent feature space that captures shadow-free priors and a two-stage training strategy, combined with a Dense Latent Variable Fusion module to prevent local optima and boost texture fidelity. Empirical results on AISTD, ISTD, SRD, and DESOBA demonstrate state-of-the-art performance, including significant gains for instance-level shadow removal. The approach offers a principled way to guide diffusion models with perceptual priors and suggests broader applicability to other ill-posed low-level vision tasks.

Abstract

Recovering textures under shadows has remained a challenging problem due to the difficulty of inferring shadow-free scenes from shadow images. In this paper, we propose the use of diffusion models as they offer a promising approach to gradually refine the details of shadow regions during the diffusion process. Our method improves this process by conditioning on a learned latent feature space that inherits the characteristics of shadow-free images, thus avoiding the limitation of conventional methods that condition on degraded images only. Additionally, we propose to alleviate potential local optima during training by fusing noise features with the diffusion network. We demonstrate the effectiveness of our approach which outperforms the previous best method by 13% in terms of RMSE on the AISTD dataset. Further, we explore instance-level shadow removal, where our model outperforms the previous best method by 82% in terms of RMSE on the DESOBA dataset.

Latent Feature-Guided Diffusion Models for Shadow Removal

TL;DR

This work tackles shadow removal by reformulating it as a diffusion-model restoration problem conditioned on shadows. It introduces a learnable latent feature space that captures shadow-free priors and a two-stage training strategy, combined with a Dense Latent Variable Fusion module to prevent local optima and boost texture fidelity. Empirical results on AISTD, ISTD, SRD, and DESOBA demonstrate state-of-the-art performance, including significant gains for instance-level shadow removal. The approach offers a principled way to guide diffusion models with perceptual priors and suggests broader applicability to other ill-posed low-level vision tasks.

Abstract

Recovering textures under shadows has remained a challenging problem due to the difficulty of inferring shadow-free scenes from shadow images. In this paper, we propose the use of diffusion models as they offer a promising approach to gradually refine the details of shadow regions during the diffusion process. Our method improves this process by conditioning on a learned latent feature space that inherits the characteristics of shadow-free images, thus avoiding the limitation of conventional methods that condition on degraded images only. Additionally, we propose to alleviate potential local optima during training by fusing noise features with the diffusion network. We demonstrate the effectiveness of our approach which outperforms the previous best method by 13% in terms of RMSE on the AISTD dataset. Further, we explore instance-level shadow removal, where our model outperforms the previous best method by 82% in terms of RMSE on the DESOBA dataset.
Paper Structure (18 sections, 19 equations, 20 figures, 6 tables)

This paper contains 18 sections, 19 equations, 20 figures, 6 tables.

Figures (20)

  • Figure 1: Given a shadow mask, our method effectively removes shadows and recovers the underlying details for shadows at the general level (top two rows) or instance level (bottom two rows). From left to right, we show the input image, shadow mask, SG-ShadowNet wan2022 result, our method result, and shadow-free images for comparisons.
  • Figure 2: Our baseline method, which conditions diffusion models solely on shadow and mask images, produces incorrect results such as color mixing in highlight areas. In contrast, our proposed method generates results with consistent and reasonable colors that match the surrounding area.
  • Figure 3: Our diffusion model architecture is illustrated in this backward diffusion diagram. The latent feature encoder $\mathcal{E}_\theta(\cdot)$ takes the shadow image $\mathbf{x}\in\mathbb{R}^{3 \times H \times W}$ and shadow mask $m \in \mathbb{R}^{1 \times H \times W}$ as input, with a resolution of $H\times W$, and acquires the latent feature in a compressed dimension of $1 \times H \times W$. The diffusion network $\epsilon_\theta(\cdot)$ conditioned on $(\mathbf{x}, m)$ takes the latent feature concatenated with the noisy image $\mathbf{y}_t\in \mathbb{R}^{3 \times H \times W}$ as input, and estimates the noiseless image $\mathbf{y}_{t-1}\in \mathbb{R}^{3 \times H \times W}$ at each diffusion process $p_\theta(\cdot)$. In this process, the noise encoder takes the noise image $\mathbf{y}_t$ as input and acquires a 1-D vector as the noise embedding, which is fused with the diffusion network features by modulation for escaping the local optima.
  • Figure 4: The diagram illustrates the two-stage learning approach used in our proposed method. In the pretraining stage (top row), the diffusion network is trained on shadow-free images to learn a latent feature space that captures informative shadow-free priors as guidance. In the finetuning stage (bottom row), we initialize the diffusion network with the pretraining weights from (a) for shadow removal under the latent feature guidance.
  • Figure 5: Visual comparisons of different guidance strategies in shadow removal literature. (a) to (d): shadow image, invariant color map zhu_bijective_2022, coarse deshadowed image wan2022, and our learned latent feature. Our approach provides more perceptual information than (b) and contains fewer shadow features than (c), which still retains a shadow boundary.
  • ...and 15 more figures