Table of Contents
Fetching ...

VSDiffusion: Taming Ill-Posed Shadow Generation via Visibility-Constrained Diffusion

Jing Li, Jing Zhang

TL;DR

The proposed VSDiffusion, a visibility-constrained two-stage framework designed to narrow the solution space by incorporating visibility priors, can generate accurate shadow, and establishes new SOTA results across most evaluation metrics.

Abstract

Generating realistic cast shadows for inserted foreground objects is a crucial yet challenging problem in image composition, where maintaining geometric consistency of shadow and object in complex scenes remains difficult due to the ill-posed nature of shadow formation. To address this issue, we propose VSDiffusion, a visibility-constrained two-stage framework designed to narrow the solution space by incorporating visibility priors. In Stage I, we predict a coarse shadow mask to localize plausible shadow generated regions. And in Stage II, conditional diffusion is performed guided by lighting and depth cues estimated from the composite to generate accurate shadows. In VSDiffusion, we inject visibility priors through two complementary pathways. First, a visibility control branch with shadow-gated cross attention that provides multi-scale structural guidance. Then, a learned soft prior map that reweights training loss in error-prone regions to enhance geometric correction. Additionally, we also introduce high-frequency guided enhancement module to sharpen boundaries and improve texture interaction with the background. Experiments on widely used public DESOBAv2 dataset demonstrated that our proposed VSDiffusion can generate accurate shadow, and establishes new SOTA results across most evaluation metrics.

VSDiffusion: Taming Ill-Posed Shadow Generation via Visibility-Constrained Diffusion

TL;DR

The proposed VSDiffusion, a visibility-constrained two-stage framework designed to narrow the solution space by incorporating visibility priors, can generate accurate shadow, and establishes new SOTA results across most evaluation metrics.

Abstract

Generating realistic cast shadows for inserted foreground objects is a crucial yet challenging problem in image composition, where maintaining geometric consistency of shadow and object in complex scenes remains difficult due to the ill-posed nature of shadow formation. To address this issue, we propose VSDiffusion, a visibility-constrained two-stage framework designed to narrow the solution space by incorporating visibility priors. In Stage I, we predict a coarse shadow mask to localize plausible shadow generated regions. And in Stage II, conditional diffusion is performed guided by lighting and depth cues estimated from the composite to generate accurate shadows. In VSDiffusion, we inject visibility priors through two complementary pathways. First, a visibility control branch with shadow-gated cross attention that provides multi-scale structural guidance. Then, a learned soft prior map that reweights training loss in error-prone regions to enhance geometric correction. Additionally, we also introduce high-frequency guided enhancement module to sharpen boundaries and improve texture interaction with the background. Experiments on widely used public DESOBAv2 dataset demonstrated that our proposed VSDiffusion can generate accurate shadow, and establishes new SOTA results across most evaluation metrics.
Paper Structure (23 sections, 13 equations, 8 figures, 3 tables)

This paper contains 23 sections, 13 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: (a) Shadow generation is an ill-posed problem: one input may correspond to multiple visually plausible shadows. With visibility analysis, geometric constraints from the light source, caster, and the shadow receiver can significantly narrow the solution space. (b) further illustrates how the solution space is progressively narrowed. The hatched area indicates the intersection of $\Omega_{vis}$, $\Omega_{data}$, and $\Omega_{all}$. S* is the selected solution that balances data fidelity with geometric plausibility.
  • Figure 2: Framework of our VSDiffusion. It consists of two stages. Stage 1 predicts a coarse foreground shadow mask $M_{fs}^{(1)}$. Stage 2 includes two sub-modules: (a) the Visibility Control Branch (VCB), which utilizes visibility prior estimator to extract $I_{\text{light}}$ and $I_{\text{depth}}$ from $I_\text{c}$, subsequently encoded by RCE to provide structural guidance. (b) Under this guidance, the U-Net progressively denoises $Z_t$ into $Z_0$.
  • Figure 3: Architecture of the Shadow-Gated Cross-Attention (SGCA) module. SGCA computes cross-attention between U-Net features $X$ (queries) and conditional features $C$ (keys/values). Shadow Gate mechanism, consisting of dropout, convolution, and sigmoid, adaptively modulates the attention output before it is residually added to $X$, ensuring robust and selective integration of visibility priors.
  • Figure 4: Architecture of the High-Frequency Guided Enhancement (HFGE) module. HFGE consists of High-Frequency Extraction (blue) to capture structural cues from $F_e$, and High-Frequency Adaptation (green) to align and calibrate these signals via CBAM. The refined guidance is residually added to decoder features $F_d$ to enhance shadow boundaries and textures.
  • Figure 5: Generation of the spatial prior in SWL. Visibility-related inputs $\{I_{\text{light}}, I_{\text{depth}}, M_{\text{fo}}, M_{\text{fs}}^{(1)}\}$ are concatenated and fed into a lightweight predictor $G_p$, followed by sigmoid to obtain $S_{\text{prior}}$. Mean normalization then produces $\hat{S}_{\text{prior}}$ (with mean 1), which is used for spatial reweighting in SWL.
  • ...and 3 more figures