Semantic-guided Adversarial Diffusion Model for Self-supervised Shadow Removal
Ziqi Zeng, Chen Zhao, Weiling Cai, Chenyu Dong
TL;DR
This work tackles unsupervised shadow removal by marrying semantic guidance with diffusion-based refinement in a two-stage framework. A coarse stage (SG-GAN) performs shadow generation and removal and constructs paired data via cycle-consistency, while a refinement stage (DBRM) employs an IR-SDE diffusion process to restore texture and reduce edge artifacts. A general-purpose Multi-modal Semantic Prompter (MSP) leverages CLIP image/text features to inject semantic priors, improving restoration quality across real and synthetic data. Across ISTD and AISTD datasets, the method achieves competitive results with state-of-the-art unsupervised approaches and shows robust gains in texture fidelity and boundary smoothness, with ablation studies confirming the importance of DBRM, MSP, and individual losses. The approach offers a practical, self-supervised solution that reduces reliance on paired data and enhances real-world shadow removal performance.
Abstract
Existing unsupervised methods have addressed the challenges of inconsistent paired data and tedious acquisition of ground-truth labels in shadow removal tasks. However, GAN-based training often faces issues such as mode collapse and unstable optimization. Furthermore, due to the complex mapping between shadow and shadow-free domains, merely relying on adversarial learning is not enough to capture the underlying relationship between two domains, resulting in low quality of the generated images. To address these problems, we propose a semantic-guided adversarial diffusion framework for self-supervised shadow removal, which consists of two stages. At first stage a semantic-guided generative adversarial network (SG-GAN) is proposed to carry out a coarse result and construct paired synthetic data through a cycle-consistent structure. Then the coarse result is refined with a diffusion-based restoration module (DBRM) to enhance the texture details and edge artifact at second stage. Meanwhile, we propose a multi-modal semantic prompter (MSP) that aids in extracting accurate semantic information from real images and text, guiding the shadow removal network to restore images better in SG-GAN. We conduct experiments on multiple public datasets, and the experimental results demonstrate the effectiveness of our method.
