UniRelight: Learning Joint Decomposition and Synthesis for Video Relighting
Kai He, Ruofan Liang, Jacob Munkberg, Jon Hasselgren, Nandita Vijaykumar, Alexander Keller, Sanja Fidler, Igor Gilitschenski, Zan Gojcic, Zian Wang
TL;DR
The paper tackles the challenge of relighting from a single image or video under limited multi-illumination data by introducing UniRelight, a joint intrinsic-illumination diffusion framework. It jointly denoises latent representations of input, albedo, and relit output, leveraging HDR lighting encodings and cross-modal attention within a video diffusion transformer. Trained on a hybrid dataset of synthetic multi-illumination scenes and auto-labeled real-world videos, it achieves superior visual fidelity and temporal consistency compared with state-of-the-art baselines, and supports illumination augmentation for practical applications. The work reduces error accumulation seen in two-stage pipelines by implicitly modeling scene properties, enabling robust relighting across diverse scenes and materials.
Abstract
We address the challenge of relighting a single image or video, a task that demands precise scene intrinsic understanding and high-quality light transport synthesis. Existing end-to-end relighting models are often limited by the scarcity of paired multi-illumination data, restricting their ability to generalize across diverse scenes. Conversely, two-stage pipelines that combine inverse and forward rendering can mitigate data requirements but are susceptible to error accumulation and often fail to produce realistic outputs under complex lighting conditions or with sophisticated materials. In this work, we introduce a general-purpose approach that jointly estimates albedo and synthesizes relit outputs in a single pass, harnessing the generative capabilities of video diffusion models. This joint formulation enhances implicit scene comprehension and facilitates the creation of realistic lighting effects and intricate material interactions, such as shadows, reflections, and transparency. Trained on synthetic multi-illumination data and extensive automatically labeled real-world videos, our model demonstrates strong generalization across diverse domains and surpasses previous methods in both visual fidelity and temporal consistency.
