Table of Contents
Fetching ...

DiFaReli++: Diffusion Face Relighting with Consistent Cast Shadows

Puntawat Ponglertnapakorn, Nontawat Tritrong, Supasorn Suwajanakorn

TL;DR

DiFaReli++ tackles single-view face relighting in-the-wild by eschewing explicit intrinsic decomposition and instead learning a diffusion-based relighting model conditioned on explicit, interpretable light-related encodings. It introduces a novel conditioning scheme for a conditional DDIM that renders a shading reference and adapts both spatial (via a Modulator) and non-spatial (via AdaGN) conditioning to faithfully separate lighting from identity and geometry. The key additions in DiFaReli++ are (i) cast-shadow-consistent relighting via shadow-map conditioning, (ii) expansion to relight non-facial parts using segmentation masks, and (iii) a distilled single-shot network (DiFaReli++_ss) trained on synthetic pairs that achieves near-teacher quality with orders-of-magnitude speedups; the latter reaches about 0.07 seconds per image. Empirically, the method achieves state-of-the-art performance on Multi-PIE, demonstrates strong temporal consistency of cast shadows under moving lights, and is validated by user studies, while acknowledging limitations tied to the quality of lighting estimation and external-shadow handling. The framework enables robust, photorealistic relighting in the wild and offers practical pathways to real-time relighting through distillation, with potential extensions to full-body scenes and HDR lighting. Key mathematical motifs include diffusion-based learning with a simple objective L_simple over $T=1000$ steps and manipulation of a disentangled light code $\mathbf{l}$ alongside a target shading reference and a shadow map, enabling controlled lighting manipulations without explicit 3D ground truth.

Abstract

We introduce a novel approach to single-view face relighting in the wild, addressing challenges such as global illumination and cast shadows. A common scheme in recent methods involves intrinsically decomposing an input image into 3D shape, albedo, and lighting, then recomposing it with the target lighting. However, estimating these components is error-prone and requires many training examples with ground-truth lighting to generalize well. Our work bypasses the need for accurate intrinsic estimation and can be trained solely on 2D images without any light stage data, relit pairs, multi-view images, or lighting ground truth. Our key idea is to leverage a conditional diffusion implicit model (DDIM) for decoding a disentangled light encoding along with other encodings related to 3D shape and facial identity inferred from off-the-shelf estimators. We propose a novel conditioning technique that simplifies modeling the complex interaction between light and geometry. It uses a rendered shading reference along with a shadow map, inferred using a simple and effective technique, to spatially modulate the DDIM. Moreover, we propose a single-shot relighting framework that requires just one network pass, given pre-processed data, and even outperforms the teacher model across all metrics. Our method realistically relights in-the-wild images with temporally consistent cast shadows under varying lighting conditions. We achieve state-of-the-art performance on the standard benchmark Multi-PIE and rank highest in user studies.

DiFaReli++: Diffusion Face Relighting with Consistent Cast Shadows

TL;DR

DiFaReli++ tackles single-view face relighting in-the-wild by eschewing explicit intrinsic decomposition and instead learning a diffusion-based relighting model conditioned on explicit, interpretable light-related encodings. It introduces a novel conditioning scheme for a conditional DDIM that renders a shading reference and adapts both spatial (via a Modulator) and non-spatial (via AdaGN) conditioning to faithfully separate lighting from identity and geometry. The key additions in DiFaReli++ are (i) cast-shadow-consistent relighting via shadow-map conditioning, (ii) expansion to relight non-facial parts using segmentation masks, and (iii) a distilled single-shot network (DiFaReli++_ss) trained on synthetic pairs that achieves near-teacher quality with orders-of-magnitude speedups; the latter reaches about 0.07 seconds per image. Empirically, the method achieves state-of-the-art performance on Multi-PIE, demonstrates strong temporal consistency of cast shadows under moving lights, and is validated by user studies, while acknowledging limitations tied to the quality of lighting estimation and external-shadow handling. The framework enables robust, photorealistic relighting in the wild and offers practical pathways to real-time relighting through distillation, with potential extensions to full-body scenes and HDR lighting. Key mathematical motifs include diffusion-based learning with a simple objective L_simple over steps and manipulation of a disentangled light code alongside a target shading reference and a shadow map, enabling controlled lighting manipulations without explicit 3D ground truth.

Abstract

We introduce a novel approach to single-view face relighting in the wild, addressing challenges such as global illumination and cast shadows. A common scheme in recent methods involves intrinsically decomposing an input image into 3D shape, albedo, and lighting, then recomposing it with the target lighting. However, estimating these components is error-prone and requires many training examples with ground-truth lighting to generalize well. Our work bypasses the need for accurate intrinsic estimation and can be trained solely on 2D images without any light stage data, relit pairs, multi-view images, or lighting ground truth. Our key idea is to leverage a conditional diffusion implicit model (DDIM) for decoding a disentangled light encoding along with other encodings related to 3D shape and facial identity inferred from off-the-shelf estimators. We propose a novel conditioning technique that simplifies modeling the complex interaction between light and geometry. It uses a rendered shading reference along with a shadow map, inferred using a simple and effective technique, to spatially modulate the DDIM. Moreover, we propose a single-shot relighting framework that requires just one network pass, given pre-processed data, and even outperforms the teacher model across all metrics. Our method realistically relights in-the-wild images with temporally consistent cast shadows under varying lighting conditions. We achieve state-of-the-art performance on the standard benchmark Multi-PIE and rank highest in user studies.
Paper Structure (44 sections, 8 equations, 37 figures, 5 tables)

This paper contains 44 sections, 8 equations, 37 figures, 5 tables.

Figures (37)

  • Figure 1: Our method addresses one of the most challenging relighting scenarios where input images contain strong highlights and cast shadows. It effectively removes these effects and generates convincing shading and temporally consistent new shadows---all in a single network pass, given pre-processed input data. It also works across varying head poses, identities, and facial makeup.
  • Figure 2: Overview of DiFaReli++. We use off-the-shelf estimators to derive various encodings from the input image: segmentation masks, shadow map, (light, shape, camera) parameters, and face embedding. These encodings are then fed into a conditional DDIM via "spatial" and "non-spatial" conditioning techniques. For spatial conditioning, a shading reference, shadow map, and segmentation masks are concatenated and fed into the Modulator to produce spatial modulation weights for the first half of the DDIM. Meanwhile, the 3D shape, camera, and face embedding are concatenated and processed by a set of MLPs, which modulate the DDIM using a modified version of adaptive group normalization (AdaGN). For DiFaReli, please see Figure \ref{['fig:pipelinecompare']}.
  • Figure 3: Computing the shadow map for training. We used a pretrained DiFaReli model to generate stronger and reduced versions of the input image, then identify shadow areas through pixel differences. Our process produces more accurate and spatially aligned shadow maps compared to ray-traced maps shown in red, which suffer from inaccurate lighting and geometry estimation.
  • Figure 4: Modifications of the Modulator’s input in DiFaReli++. The input is a concatenation of the shadow map, the shading reference, and segmentation masks (see all masks in Figure \ref{['fig_app:seg_mask']}). This modification allows generation of consistent cast shadows and enables relighting of non-facial parts.
  • Figure 5: Single-shot face relighting framework involves a) using DiFaReli++ to generate supervised relit pairs and b) training a single-shot relighting network with the same architecture as DiFaReli++ using the training pairs with a simple L2 loss.
  • ...and 32 more figures