Physics-Grounded Shadow Generation from Monocular 3D Geometry Priors and Approximate Light Direction
Shilin Hu, Jingyi Xu, Akshat Dave, Dimitris Samaras, Hieu Le
TL;DR
This work tackles shadow synthesis from a single image by injecting explicit physics of shadow formation—geometry-illumination coupling—into a diffusion-based framework. It leverages monocular point maps (via MoGe-2) and a predicted 3D light direction to generate an initial, geometry-consistent shadow, which is then refined by a ControlNet-conditioned diffusion model. The approach combines a control encoder, light predictor, and mask predictor to produce physically grounded shadows and accurate shadow masks, with ablations confirming the benefit of each physics-driven cue. Evaluations on DESOBAV2 show state-of-the-art performance in both image fidelity and shadow localization across BOS and BOS-free settings, demonstrating the practical value of explicit physics in generative shadow synthesis.
Abstract
Shadow generation aims to produce photorealistic shadows that are visually consistent with object geometry and scene illumination. In the physics of shadow formation, the occluder blocks some light rays casting from the light source that would otherwise arrive at the surface, creating a shadow that follows the silhouette of the occluder. However, such explicit physical modeling has rarely been used in deep-learning-based shadow generation. In this paper, we propose a novel framework that embeds explicit physical modeling - geometry and illumination - into deep-learning-based shadow generation. First, given a monocular RGB image, we obtain approximate 3D geometry in the form of dense point maps and predict a single dominant light direction. These signals allow us to recover fairly accurate shadow location and shape based on the physics of shadow formation. We then integrate this physics-based initial estimate into a diffusion framework that refines the shadow into a realistic, high-fidelity appearance while ensuring consistency with scene geometry and illumination. Trained on DESOBAV2, our model produces shadows that are both visually realistic and physically coherent, outperforming existing approaches, especially in scenes with complex geometry or ambiguous lighting.
