Table of Contents
Fetching ...

RelightVid: Temporal-Consistent Diffusion Model for Video Relighting

Ye Fang, Zeyi Sun, Shangzhan Zhang, Tong Wu, Yinghao Xu, Pan Zhang, Jiaqi Wang, Gordon Wetzstein, Dahua Lin

TL;DR

RelightVid addresses the challenge of temporally coherent video relighting under multi-modal conditions by lifting a pre-trained image relighting diffusion model to video with a 3D U-Net and temporal attention. It introduces LightAtlas, a large data pipeline combining in-the-wild videos and 3D-rendered data to learn robust illumination priors, and employs multi-modal conditioning including background video, text, and HDR environment maps. A novel Illumination-Invariant Ensemble stabilizes relighting under varying illumination, while joint training integrates background and text cues for coherent edits. Empirical results show improved temporal consistency and lighting fidelity across background-, text-, and HDR-conditioned scenarios, indicating strong practical potential for film, games, and AR applications.

Abstract

Diffusion models have demonstrated remarkable success in image generation and editing, with recent advancements enabling albedo-preserving image relighting. However, applying these models to video relighting remains challenging due to the lack of paired video relighting datasets and the high demands for output fidelity and temporal consistency, further complicated by the inherent randomness of diffusion models. To address these challenges, we introduce RelightVid, a flexible framework for video relighting that can accept background video, text prompts, or environment maps as relighting conditions. Trained on in-the-wild videos with carefully designed illumination augmentations and rendered videos under extreme dynamic lighting, RelightVid achieves arbitrary video relighting with high temporal consistency without intrinsic decomposition while preserving the illumination priors of its image backbone.

RelightVid: Temporal-Consistent Diffusion Model for Video Relighting

TL;DR

RelightVid addresses the challenge of temporally coherent video relighting under multi-modal conditions by lifting a pre-trained image relighting diffusion model to video with a 3D U-Net and temporal attention. It introduces LightAtlas, a large data pipeline combining in-the-wild videos and 3D-rendered data to learn robust illumination priors, and employs multi-modal conditioning including background video, text, and HDR environment maps. A novel Illumination-Invariant Ensemble stabilizes relighting under varying illumination, while joint training integrates background and text cues for coherent edits. Empirical results show improved temporal consistency and lighting fidelity across background-, text-, and HDR-conditioned scenarios, indicating strong practical potential for film, games, and AR applications.

Abstract

Diffusion models have demonstrated remarkable success in image generation and editing, with recent advancements enabling albedo-preserving image relighting. However, applying these models to video relighting remains challenging due to the lack of paired video relighting datasets and the high demands for output fidelity and temporal consistency, further complicated by the inherent randomness of diffusion models. To address these challenges, we introduce RelightVid, a flexible framework for video relighting that can accept background video, text prompts, or environment maps as relighting conditions. Trained on in-the-wild videos with carefully designed illumination augmentations and rendered videos under extreme dynamic lighting, RelightVid achieves arbitrary video relighting with high temporal consistency without intrinsic decomposition while preserving the illumination priors of its image backbone.

Paper Structure

This paper contains 18 sections, 4 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: LightAtlas Data Pipeline generates high quality video relighting pairs both based on in the wild videos and 3D rendered data.
  • Figure 2: Model Design to lift image diffusion model for temporal consistent video relighting under text prompt, background video and HDR video map.
  • Figure 3: Qualitative comparison of text-conditioned video illumination editing. Given a source video and guidance text, we compare RelightVid with other classic text-driven video editing methods, where AnyV2V initially uses ICLight to modify the first frame.
  • Figure 4: Qualitative comparison of background-conditioned video illumination editing. Given any foreground apperance and a background video reference, we relight videos and compare our method with the per-frame IC-Light (smoothed) method.
  • Figure 5: Synthetic background-condition illumination editing results. We use the Hunyuan model for long synthetic background videos with strong dynamic lighting, demonstrating the effectiveness and robustness of our method in scenarios with dynamic lighting and long videos editing.
  • ...and 6 more figures