Table of Contents
Fetching ...

Lux Post Facto: Learning Portrait Performance Relighting with Conditional Video Diffusion and a Hybrid Dataset

Yiqun Mei, Mingming He, Li Ma, Julien Philip, Wenqi Xian, David M George, Xueming Yu, Gabriel Dedic, Ahmet Levent Taşel, Ning Yu, Vishal M. Patel, Paul Debevec

TL;DR

Lux Post Facto tackles portrait video relighting by casting relighting as an HDR-conditioned, diffusion-based video generation problem. It introduces a two-stage pipeline (delighting and relighting) built on a pretrained video diffusion backbone, augmented with a novel lighting-embedding mechanism that encodes directional light as embeddings delivered via cross-attention. A hybrid dataset, combining static OLAT data and in-the-wild videos, enables robust relighting while preserving temporal coherence through an auxiliary appearance-copy task. The method achieves state-of-the-art photorealism and temporal stability on in-the-wild portraits and supports precise lighting control, with practical implications for post-production workflows. Limitations include occlusions, rotating HDR maps not expressible by the light stage, and offline inference requirements, pointing to future work in real-time performance and higher-resolution generation.

Abstract

Video portrait relighting remains challenging because the results need to be both photorealistic and temporally stable. This typically requires a strong model design that can capture complex facial reflections as well as intensive training on a high-quality paired video dataset, such as dynamic one-light-at-a-time (OLAT). In this work, we introduce Lux Post Facto, a novel portrait video relighting method that produces both photorealistic and temporally consistent lighting effects. From the model side, we design a new conditional video diffusion model built upon state-of-the-art pre-trained video diffusion model, alongside a new lighting injection mechanism to enable precise control. This way we leverage strong spatial and temporal generative capability to generate plausible solutions to the ill-posed relighting problem. Our technique uses a hybrid dataset consisting of static expression OLAT data and in-the-wild portrait performance videos to jointly learn relighting and temporal modeling. This avoids the need to acquire paired video data in different lighting conditions. Our extensive experiments show that our model produces state-of-the-art results both in terms of photorealism and temporal consistency.

Lux Post Facto: Learning Portrait Performance Relighting with Conditional Video Diffusion and a Hybrid Dataset

TL;DR

Lux Post Facto tackles portrait video relighting by casting relighting as an HDR-conditioned, diffusion-based video generation problem. It introduces a two-stage pipeline (delighting and relighting) built on a pretrained video diffusion backbone, augmented with a novel lighting-embedding mechanism that encodes directional light as embeddings delivered via cross-attention. A hybrid dataset, combining static OLAT data and in-the-wild videos, enables robust relighting while preserving temporal coherence through an auxiliary appearance-copy task. The method achieves state-of-the-art photorealism and temporal stability on in-the-wild portraits and supports precise lighting control, with practical implications for post-production workflows. Limitations include occlusions, rotating HDR maps not expressible by the light stage, and offline inference requirements, pointing to future work in real-time performance and higher-resolution generation.

Abstract

Video portrait relighting remains challenging because the results need to be both photorealistic and temporally stable. This typically requires a strong model design that can capture complex facial reflections as well as intensive training on a high-quality paired video dataset, such as dynamic one-light-at-a-time (OLAT). In this work, we introduce Lux Post Facto, a novel portrait video relighting method that produces both photorealistic and temporally consistent lighting effects. From the model side, we design a new conditional video diffusion model built upon state-of-the-art pre-trained video diffusion model, alongside a new lighting injection mechanism to enable precise control. This way we leverage strong spatial and temporal generative capability to generate plausible solutions to the ill-posed relighting problem. Our technique uses a hybrid dataset consisting of static expression OLAT data and in-the-wild portrait performance videos to jointly learn relighting and temporal modeling. This avoids the need to acquire paired video data in different lighting conditions. Our extensive experiments show that our model produces state-of-the-art results both in terms of photorealism and temporal consistency.

Paper Structure

This paper contains 44 sections, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Lux Post Facto offers portrait relighting as a simple post-production process. Users can edit the lighting of portrait images (first row) and videos (second row) with high fidelity using any HDR map. Our method is temporally stable and highly photorealistic.
  • Figure 2: Overview and model design of Lux Post Facto. To relight an input video, a delighting model predicts an albedo video (a) which is then relit by a relighting model (b). Both models share the same architecture (c) based on stable video diffusion stable_video_diffusion (SVD). We condition the SVD on the input video by concatenating input latents to the Gaussian noise. To support autoregressive prediction for long sequence, we replace the first $T$ frames with previous predictions, indicated with a binary mask concatenated to the input. The output lighting is controlled by an HDR map, converted to a light embedding fed to the U-Net through cross-attention layers. The VAE that encodes and decodes the latents is omitted for clarity.
  • Figure 3: Hybrid video dataset creation. We add synthetic camera motion to static OLAT images, and apply the image delighting model to in-the-wild videos, creating our hybrid training data.
  • Figure 4: Training with hybrid data. To train both the relighting and delighting models, we use the hybrid dataset. We train the models on two tasks simultaneously: HDR-based condition (or no condition) on the OLAT data (i.e. lighting-rich dataset $\mathcal{D}_{l}$) and reference-based appearance copy on both datasets (i.e. motion-rich dataset $\mathcal{D}_{m}$ and lighting-rich dataset $\mathcal{D}_{l}$).
  • Figure 5: Comparison against video relighting methods on in-the-wild portrait videos. For each sequence we show three input frames, with the target HDR map and reference image rendered with the same HDR map and OLAT data, both shown as inset. Our method produces more faithful lighting effects and is robust to facial expression change (first column) and head motion (last two columns).
  • ...and 13 more figures