Table of Contents
Fetching ...

ReLumix: Extending Image Relighting to Video via Video Diffusion Models

Lezhong Wang, Shutong Jin, Ruiqi Cui, Anders Bjorholm Dahl, Jeppe Revall Frisvad, Siavash Bigdeli

TL;DR

ReLumix addresses the challenge of controllable lighting in video by decoupling relighting from temporal propagation. It enables any image-based relighting technique to be applied to video via a two-stage pipeline: relight a reference frame using a preferred method, then propagate the lighting across frames with a fine-tuned stable video diffusion model. Key innovations—embedding fusion, gated cross-attention, and temporal bootstrapping—enable mask-free, coherent illumination transfer learned from synthetic data with strong sim-to-real generalization, achieving significant speedups over frame-inversion baselines. The approach demonstrates high fidelity and temporal stability on CARLA and DAVIS datasets, offering a flexible, scalable solution for dynamic lighting control in practical video editing workflows.

Abstract

Controlling illumination during video post-production is a crucial yet elusive goal in computational photography. Existing methods often lack flexibility, restricting users to certain relighting models. This paper introduces ReLumix, a novel framework that decouples the relighting algorithm from temporal synthesis, thereby enabling any image relighting technique to be seamlessly applied to video. Our approach reformulates video relighting into a simple yet effective two-stage process: (1) an artist relights a single reference frame using any preferred image-based technique (e.g., Diffusion Models, physics-based renderers); and (2) a fine-tuned stable video diffusion (SVD) model seamlessly propagates this target illumination throughout the sequence. To ensure temporal coherence and prevent artifacts, we introduce a gated cross-attention mechanism for smooth feature blending and a temporal bootstrapping strategy that harnesses SVD's powerful motion priors. Although trained on synthetic data, ReLumix shows competitive generalization to real-world videos. The method demonstrates significant improvements in visual fidelity, offering a scalable and versatile solution for dynamic lighting control.

ReLumix: Extending Image Relighting to Video via Video Diffusion Models

TL;DR

ReLumix addresses the challenge of controllable lighting in video by decoupling relighting from temporal propagation. It enables any image-based relighting technique to be applied to video via a two-stage pipeline: relight a reference frame using a preferred method, then propagate the lighting across frames with a fine-tuned stable video diffusion model. Key innovations—embedding fusion, gated cross-attention, and temporal bootstrapping—enable mask-free, coherent illumination transfer learned from synthetic data with strong sim-to-real generalization, achieving significant speedups over frame-inversion baselines. The approach demonstrates high fidelity and temporal stability on CARLA and DAVIS datasets, offering a flexible, scalable solution for dynamic lighting control in practical video editing workflows.

Abstract

Controlling illumination during video post-production is a crucial yet elusive goal in computational photography. Existing methods often lack flexibility, restricting users to certain relighting models. This paper introduces ReLumix, a novel framework that decouples the relighting algorithm from temporal synthesis, thereby enabling any image relighting technique to be seamlessly applied to video. Our approach reformulates video relighting into a simple yet effective two-stage process: (1) an artist relights a single reference frame using any preferred image-based technique (e.g., Diffusion Models, physics-based renderers); and (2) a fine-tuned stable video diffusion (SVD) model seamlessly propagates this target illumination throughout the sequence. To ensure temporal coherence and prevent artifacts, we introduce a gated cross-attention mechanism for smooth feature blending and a temporal bootstrapping strategy that harnesses SVD's powerful motion priors. Although trained on synthetic data, ReLumix shows competitive generalization to real-world videos. The method demonstrates significant improvements in visual fidelity, offering a scalable and versatile solution for dynamic lighting control.

Paper Structure

This paper contains 23 sections, 9 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: The user selects one video frame as the reference and changes this using any preferred relighting software or method. Relumix then propagates these edits across the remaining frames to complete the video editing process. Our approach is trained on the CARLA synthetic dataset, enabling the model to learn the intrinsic representation of light and shadow even from basic data. This results in strong zero-shot capabilities, allowing the model to be applied to real-world videos without any fine-tuning.
  • Figure 2: Our proposed training method. Here, $V'_{\text{input}}$ represents the input video $V_{\text{input}}$ with the reference frame $\mathbf{R}$ replaced, $[;]$ denotes the concatenation operation, $\mathbf{\tilde{R}}$ represents the replicated reference frame $\mathbf{\tilde{R}} \in \mathbb{R}^{T\times H\times W\times 3}$, and $\delta'$ denotes the naive Cross Attention of Denoising UNet replaced by the Gated Cross Attention module (Sec. \ref{['sec:GCA']}). Note that the figure only includes the components we modified. The modules that SVD originally had, such as TimestepEmbedder, are not shown.
  • Figure 3: Examples of the synthetic dataset generated using the CARLA simulator.
  • Figure 4: Qualitative evaluation of different methods on synthetic and real-world datasets. Note that these methods use different input types: DiffusionRenderer uses an envmap for relighting, while Light-A-Video and IC-Light use text prompts. I2VEdit and our method use a relit first frame for relighting. To demonstrate our method's strong generalizability to real-world conditions, we specifically selected videos for relighting that feature lighting effects considerably distinct from those in the CARLA Relight dataset. This highlights our method's ability to perform well even with unseen lighting variations.
  • Figure 5: A qualitative comparison of our method and DiffusionRenderer using the same envmap. The bottom-right corner of the first image from the left shows the used envmap. As the CARLA simulator does not employ an envmap, the ground truth results are provided for reference only.
  • ...and 2 more figures