Table of Contents
Fetching ...

DifFRelight: Diffusion-Based Facial Performance Relighting

Mingming He, Pascal Clausen, Ahmet Levent Taşel, Li Ma, Oliver Pilarski, Wenqi Xian, Laszlo Rikker, Xueming Yu, Ryan Burgert, Ning Yu, Paul Debevec

TL;DR

This work tackles the challenge of relighting free-viewpoint facial performances captured under a single flat lighting setup. It introduces a subject-specific diffusion-based relighting pipeline that uses paired flat-lit and OLAT data, with lighting encoded via Spherical Harmonics, and augments this with scalable dynamic 3D Gaussian Splatting to render novel viewpoints. Key contributions include a subject-specific diffusion model with spatial and global conditioning, a two-stage deformable 3DGS for long sequences, and a unified lighting framework that supports area-light and HDRI environment lighting. The approach delivers photorealistic relighting that preserves identity and fine details (skin, eyes, hair) and demonstrates real-world HDRI relighting, offering a practical pathway for postproduction relighting of flat-lit footage without extensive multi-light capture.

Abstract

We present a novel framework for free-viewpoint facial performance relighting using diffusion-based image-to-image translation. Leveraging a subject-specific dataset containing diverse facial expressions captured under various lighting conditions, including flat-lit and one-light-at-a-time (OLAT) scenarios, we train a diffusion model for precise lighting control, enabling high-fidelity relit facial images from flat-lit inputs. Our framework includes spatially-aligned conditioning of flat-lit captures and random noise, along with integrated lighting information for global control, utilizing prior knowledge from the pre-trained Stable Diffusion model. This model is then applied to dynamic facial performances captured in a consistent flat-lit environment and reconstructed for novel-view synthesis using a scalable dynamic 3D Gaussian Splatting method to maintain quality and consistency in the relit results. In addition, we introduce unified lighting control by integrating a novel area lighting representation with directional lighting, allowing for joint adjustments in light size and direction. We also enable high dynamic range imaging (HDRI) composition using multiple directional lights to produce dynamic sequences under complex lighting conditions. Our evaluations demonstrate the models efficiency in achieving precise lighting control and generalizing across various facial expressions while preserving detailed features such as skintexture andhair. The model accurately reproduces complex lighting effects like eye reflections, subsurface scattering, self-shadowing, and translucency, advancing photorealism within our framework.

DifFRelight: Diffusion-Based Facial Performance Relighting

TL;DR

This work tackles the challenge of relighting free-viewpoint facial performances captured under a single flat lighting setup. It introduces a subject-specific diffusion-based relighting pipeline that uses paired flat-lit and OLAT data, with lighting encoded via Spherical Harmonics, and augments this with scalable dynamic 3D Gaussian Splatting to render novel viewpoints. Key contributions include a subject-specific diffusion model with spatial and global conditioning, a two-stage deformable 3DGS for long sequences, and a unified lighting framework that supports area-light and HDRI environment lighting. The approach delivers photorealistic relighting that preserves identity and fine details (skin, eyes, hair) and demonstrates real-world HDRI relighting, offering a practical pathway for postproduction relighting of flat-lit footage without extensive multi-light capture.

Abstract

We present a novel framework for free-viewpoint facial performance relighting using diffusion-based image-to-image translation. Leveraging a subject-specific dataset containing diverse facial expressions captured under various lighting conditions, including flat-lit and one-light-at-a-time (OLAT) scenarios, we train a diffusion model for precise lighting control, enabling high-fidelity relit facial images from flat-lit inputs. Our framework includes spatially-aligned conditioning of flat-lit captures and random noise, along with integrated lighting information for global control, utilizing prior knowledge from the pre-trained Stable Diffusion model. This model is then applied to dynamic facial performances captured in a consistent flat-lit environment and reconstructed for novel-view synthesis using a scalable dynamic 3D Gaussian Splatting method to maintain quality and consistency in the relit results. In addition, we introduce unified lighting control by integrating a novel area lighting representation with directional lighting, allowing for joint adjustments in light size and direction. We also enable high dynamic range imaging (HDRI) composition using multiple directional lights to produce dynamic sequences under complex lighting conditions. Our evaluations demonstrate the models efficiency in achieving precise lighting control and generalizing across various facial expressions while preserving detailed features such as skintexture andhair. The model accurately reproduces complex lighting effects like eye reflections, subsurface scattering, self-shadowing, and translucency, advancing photorealism within our framework.

Paper Structure

This paper contains 35 sections, 8 equations, 22 figures, 3 tables.

Figures (22)

  • Figure 1: The overview of the proposed relighting pipeline including dynamic performance reconstruction and diffusion-based relighting. Starting with multi-view performance data of a subject in a neutral environment, we train a deformable 3DGS to create novel-view renderings of the dynamic sequence. These serve as inputs for a diffusion-based relighting model, trained on paired data to translate flat-lit input images to relit results based on specified lighting. Here, we show the inference step of the diffusion model, where the latent representation of the flat-lit image is concatenated with random noise as input for the diffusion U-Net. Lighting information, encoded as SH encoding together with the text embedding, regulates the diffusion process.
  • Figure 2: Capture stage and data. (a) outside and inside of the LED panel volcap stage based on the LED panels, with and without the subject. (b) examples of our collected training data including variation in expression ($\mathbf{E}$), lighting ($\mathbf{L}$), and viewpoint ($\mathbf{V}$).
  • Figure 3: Two-stage training of the scalable dynamic 3DGS. We first sample $K$ frames to partition the long sequence into segments ${S_1, S_2, ...S_k}$. At Stage 1, we train the deformable 3DGS on the $K$ frames only to generate the initialization for the training of each segment at Stage 2. Then we train a deformable 3DGS for each segment but conditioned on the initialization.
  • Figure 4: Visual comparison between using and not using pyramid noise during training. Using pyramid noise improves the color consistency between the prediction and GT. We increase the brightness of the images by 20% to exaggerate the error.
  • Figure 5: Ablation study on using pre-trained model weights. We ablate our method by randomly initializing the model weights instead of loading from a pre-trained model (w/o pre-trained). Results show that loading from a pre-trained stable diffusion model helps generate photorealistic results with fewer artifacts.
  • ...and 17 more figures