Table of Contents
Fetching ...

Physics-Grounded Shadow Generation from Monocular 3D Geometry Priors and Approximate Light Direction

Shilin Hu, Jingyi Xu, Akshat Dave, Dimitris Samaras, Hieu Le

TL;DR

This work tackles shadow synthesis from a single image by injecting explicit physics of shadow formation—geometry-illumination coupling—into a diffusion-based framework. It leverages monocular point maps (via MoGe-2) and a predicted 3D light direction to generate an initial, geometry-consistent shadow, which is then refined by a ControlNet-conditioned diffusion model. The approach combines a control encoder, light predictor, and mask predictor to produce physically grounded shadows and accurate shadow masks, with ablations confirming the benefit of each physics-driven cue. Evaluations on DESOBAV2 show state-of-the-art performance in both image fidelity and shadow localization across BOS and BOS-free settings, demonstrating the practical value of explicit physics in generative shadow synthesis.

Abstract

Shadow generation aims to produce photorealistic shadows that are visually consistent with object geometry and scene illumination. In the physics of shadow formation, the occluder blocks some light rays casting from the light source that would otherwise arrive at the surface, creating a shadow that follows the silhouette of the occluder. However, such explicit physical modeling has rarely been used in deep-learning-based shadow generation. In this paper, we propose a novel framework that embeds explicit physical modeling - geometry and illumination - into deep-learning-based shadow generation. First, given a monocular RGB image, we obtain approximate 3D geometry in the form of dense point maps and predict a single dominant light direction. These signals allow us to recover fairly accurate shadow location and shape based on the physics of shadow formation. We then integrate this physics-based initial estimate into a diffusion framework that refines the shadow into a realistic, high-fidelity appearance while ensuring consistency with scene geometry and illumination. Trained on DESOBAV2, our model produces shadows that are both visually realistic and physically coherent, outperforming existing approaches, especially in scenes with complex geometry or ambiguous lighting.

Physics-Grounded Shadow Generation from Monocular 3D Geometry Priors and Approximate Light Direction

TL;DR

This work tackles shadow synthesis from a single image by injecting explicit physics of shadow formation—geometry-illumination coupling—into a diffusion-based framework. It leverages monocular point maps (via MoGe-2) and a predicted 3D light direction to generate an initial, geometry-consistent shadow, which is then refined by a ControlNet-conditioned diffusion model. The approach combines a control encoder, light predictor, and mask predictor to produce physically grounded shadows and accurate shadow masks, with ablations confirming the benefit of each physics-driven cue. Evaluations on DESOBAV2 show state-of-the-art performance in both image fidelity and shadow localization across BOS and BOS-free settings, demonstrating the practical value of explicit physics in generative shadow synthesis.

Abstract

Shadow generation aims to produce photorealistic shadows that are visually consistent with object geometry and scene illumination. In the physics of shadow formation, the occluder blocks some light rays casting from the light source that would otherwise arrive at the surface, creating a shadow that follows the silhouette of the occluder. However, such explicit physical modeling has rarely been used in deep-learning-based shadow generation. In this paper, we propose a novel framework that embeds explicit physical modeling - geometry and illumination - into deep-learning-based shadow generation. First, given a monocular RGB image, we obtain approximate 3D geometry in the form of dense point maps and predict a single dominant light direction. These signals allow us to recover fairly accurate shadow location and shape based on the physics of shadow formation. We then integrate this physics-based initial estimate into a diffusion framework that refines the shadow into a realistic, high-fidelity appearance while ensuring consistency with scene geometry and illumination. Trained on DESOBAV2, our model produces shadows that are both visually realistic and physically coherent, outperforming existing approaches, especially in scenes with complex geometry or ambiguous lighting.

Paper Structure

This paper contains 14 sections, 14 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Shadows estimates from approximate geometry and light direction. Given a monocular RGB image and foreground mask, we recover approximate 3D point map and a single dominant light direction to infer a shadow estimate using object points.
  • Figure 2: Comparison with state-of-the-art shadow generation methods. Our physics-grounded approach produces shadows that more faithfully align with the occluder geometry and scene lighting than state-of-the-art methods, SGDiffusion liu2024shadow and GPSD zhao2025shadow. We evaluate both on scenes with reference object-shadow pairs in the background (BOS) and BOS-free scenes.
  • Figure 3: Framework overview. We inject monocular geometry by stacking the shadow-free image, foreground object mask, and dense point map as a control signal. This feeds a control encoder and a light predictor; the predictor outputs a 3D light vector to render a shadow estimate. A mask predictor fuses the shadow estimate with diffusion features to predict the shadow mask in a coarse-to-fine scheme. The denoising U-Net and the intensity encoder are frozen during training.
  • Figure 4: Mask Predictor. From the foreground mask $M_{fo}$, point map $\mathbf{P}$, and predicted light $\hat{\mathbf{l}}$, we form an illumination-consistent shadow estimate at $64\times64$. The coarse stage stacks this estimate with denoising U-Net decoder features, upsamples to $512\times512$, and the fine stage concatenates the coarse map with $M_{fo}$, $\mathbf{P}$, and a broadcast light map, then applies a residual refinement for the final mask.
  • Figure 5: Qualitative comparison with SOTA. Visual results in both BOS (with background reference object–shadow pairs) and BOS-free (single object–shadow pair) settings. We compare generated images and predicted shadow masks against SGDiffusionliu2024shadow, GPSDzhao2025shadow, and ground truth. Our method consistently produces higher image fidelity and more accurate shadow masks that better respect occluder–receiver–illumination relationships.
  • ...and 4 more figures