Table of Contents
Fetching ...

UniRelight: Learning Joint Decomposition and Synthesis for Video Relighting

Kai He, Ruofan Liang, Jacob Munkberg, Jon Hasselgren, Nandita Vijaykumar, Alexander Keller, Sanja Fidler, Igor Gilitschenski, Zan Gojcic, Zian Wang

TL;DR

The paper tackles the challenge of relighting from a single image or video under limited multi-illumination data by introducing UniRelight, a joint intrinsic-illumination diffusion framework. It jointly denoises latent representations of input, albedo, and relit output, leveraging HDR lighting encodings and cross-modal attention within a video diffusion transformer. Trained on a hybrid dataset of synthetic multi-illumination scenes and auto-labeled real-world videos, it achieves superior visual fidelity and temporal consistency compared with state-of-the-art baselines, and supports illumination augmentation for practical applications. The work reduces error accumulation seen in two-stage pipelines by implicitly modeling scene properties, enabling robust relighting across diverse scenes and materials.

Abstract

We address the challenge of relighting a single image or video, a task that demands precise scene intrinsic understanding and high-quality light transport synthesis. Existing end-to-end relighting models are often limited by the scarcity of paired multi-illumination data, restricting their ability to generalize across diverse scenes. Conversely, two-stage pipelines that combine inverse and forward rendering can mitigate data requirements but are susceptible to error accumulation and often fail to produce realistic outputs under complex lighting conditions or with sophisticated materials. In this work, we introduce a general-purpose approach that jointly estimates albedo and synthesizes relit outputs in a single pass, harnessing the generative capabilities of video diffusion models. This joint formulation enhances implicit scene comprehension and facilitates the creation of realistic lighting effects and intricate material interactions, such as shadows, reflections, and transparency. Trained on synthetic multi-illumination data and extensive automatically labeled real-world videos, our model demonstrates strong generalization across diverse domains and surpasses previous methods in both visual fidelity and temporal consistency.

UniRelight: Learning Joint Decomposition and Synthesis for Video Relighting

TL;DR

The paper tackles the challenge of relighting from a single image or video under limited multi-illumination data by introducing UniRelight, a joint intrinsic-illumination diffusion framework. It jointly denoises latent representations of input, albedo, and relit output, leveraging HDR lighting encodings and cross-modal attention within a video diffusion transformer. Trained on a hybrid dataset of synthetic multi-illumination scenes and auto-labeled real-world videos, it achieves superior visual fidelity and temporal consistency compared with state-of-the-art baselines, and supports illumination augmentation for practical applications. The work reduces error accumulation seen in two-stage pipelines by implicitly modeling scene properties, enabling robust relighting across diverse scenes and materials.

Abstract

We address the challenge of relighting a single image or video, a task that demands precise scene intrinsic understanding and high-quality light transport synthesis. Existing end-to-end relighting models are often limited by the scarcity of paired multi-illumination data, restricting their ability to generalize across diverse scenes. Conversely, two-stage pipelines that combine inverse and forward rendering can mitigate data requirements but are susceptible to error accumulation and often fail to produce realistic outputs under complex lighting conditions or with sophisticated materials. In this work, we introduce a general-purpose approach that jointly estimates albedo and synthesizes relit outputs in a single pass, harnessing the generative capabilities of video diffusion models. This joint formulation enhances implicit scene comprehension and facilitates the creation of realistic lighting effects and intricate material interactions, such as shadows, reflections, and transparency. Trained on synthetic multi-illumination data and extensive automatically labeled real-world videos, our model demonstrates strong generalization across diverse domains and surpasses previous methods in both visual fidelity and temporal consistency.

Paper Structure

This paper contains 25 sections, 1 equation, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Given an input image (top left) or video, our method jointly estimates albedo (bottom left) and synthesizes relit videos with novel lighting conditions using provided HDR probes. Notably, our estimated albedo maps effectively demodulate shadows and specular highlights, while the relit images exhibit plausible shadows and specular highlights.
  • Figure 2: Method overview. Given an input video $\mathbf{I}$ and a target lighting configuration $(\textbf{E}_{\text{ldr}}, \mathbf{E}_{\text{log}}, \mathbf{E}_{\text{dir}})$, our method jointly predicts a relit video $\hat{\mathbf{I}}_E$ and its corresponding albedo $\hat{\mathbf{a}}$. We use a pretrained VAE encoder-decoder pair $(\mathcal{E}, \mathcal{D})$ to map input and output videos to a latent space. The latents for the target relit video and albedo are concatenated along the temporal (frame) dimension with the encoded input video. Lighting features $\mathbf{h}^{\mathbf{E}}$, derived from the environment maps, are concatenated along the channel dimension with the relit video latent. A finetuned DiT video model denoises the joint latent according to Equation \ref{['eq:denoiser']}, enabling consistent generation of both relit appearance and intrinsic decomposition.
  • Figure 3: Qualitative comparison on the synthetic dataset and MIT multi-illumination dataset. Our method produces high-quality inter-reflections and shadows in synthetic scenes (top rows). Crucially, on the MIT multi-illumination dataset (bottom rows), it delivers relighting results with higher accuracy than baselines, which fail when faced with complex materials.
  • Figure 4: Qualitative comparison on in-the-wild data. Our method generates more plausible results than the baselines, with higher quality and more realistic appearance.
  • Figure 5: Ablation on joint modeling. Relighting results on urban street scenes. The orange and green crops highlight regions where the pure relighting model (w/o joint modeling) clearly bakes shadows from the input image into the relit result. Our joint model correctly demodulates the shadows.
  • ...and 5 more figures