DiFaReli++: Diffusion Face Relighting with Consistent Cast Shadows
Puntawat Ponglertnapakorn, Nontawat Tritrong, Supasorn Suwajanakorn
TL;DR
DiFaReli++ tackles single-view face relighting in-the-wild by eschewing explicit intrinsic decomposition and instead learning a diffusion-based relighting model conditioned on explicit, interpretable light-related encodings. It introduces a novel conditioning scheme for a conditional DDIM that renders a shading reference and adapts both spatial (via a Modulator) and non-spatial (via AdaGN) conditioning to faithfully separate lighting from identity and geometry. The key additions in DiFaReli++ are (i) cast-shadow-consistent relighting via shadow-map conditioning, (ii) expansion to relight non-facial parts using segmentation masks, and (iii) a distilled single-shot network (DiFaReli++_ss) trained on synthetic pairs that achieves near-teacher quality with orders-of-magnitude speedups; the latter reaches about 0.07 seconds per image. Empirically, the method achieves state-of-the-art performance on Multi-PIE, demonstrates strong temporal consistency of cast shadows under moving lights, and is validated by user studies, while acknowledging limitations tied to the quality of lighting estimation and external-shadow handling. The framework enables robust, photorealistic relighting in the wild and offers practical pathways to real-time relighting through distillation, with potential extensions to full-body scenes and HDR lighting. Key mathematical motifs include diffusion-based learning with a simple objective L_simple over $T=1000$ steps and manipulation of a disentangled light code $\mathbf{l}$ alongside a target shading reference and a shadow map, enabling controlled lighting manipulations without explicit 3D ground truth.
Abstract
We introduce a novel approach to single-view face relighting in the wild, addressing challenges such as global illumination and cast shadows. A common scheme in recent methods involves intrinsically decomposing an input image into 3D shape, albedo, and lighting, then recomposing it with the target lighting. However, estimating these components is error-prone and requires many training examples with ground-truth lighting to generalize well. Our work bypasses the need for accurate intrinsic estimation and can be trained solely on 2D images without any light stage data, relit pairs, multi-view images, or lighting ground truth. Our key idea is to leverage a conditional diffusion implicit model (DDIM) for decoding a disentangled light encoding along with other encodings related to 3D shape and facial identity inferred from off-the-shelf estimators. We propose a novel conditioning technique that simplifies modeling the complex interaction between light and geometry. It uses a rendered shading reference along with a shadow map, inferred using a simple and effective technique, to spatially modulate the DDIM. Moreover, we propose a single-shot relighting framework that requires just one network pass, given pre-processed data, and even outperforms the teacher model across all metrics. Our method realistically relights in-the-wild images with temporally consistent cast shadows under varying lighting conditions. We achieve state-of-the-art performance on the standard benchmark Multi-PIE and rank highest in user studies.
