Table of Contents
Fetching ...

UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback

Ropeway Liu, Hangjie Yuan, Bo Dong, Jiazheng Xing, Jinwang Wang, Rui Zhao, Yan Xing, Weihua Chen, Fan Wang

TL;DR

UniLumos tackles physically plausible image and video relighting under flexible prompts by integrating a flow-matching diffusion backbone with physics-plausible RGB-space feedback. It supervises lighting with depth and surface normals, enabling alignment of illumination with scene geometry, and introduces LumosData with a six-dimensional lighting annotation protocol plus LumosBench for attribute-level evaluation. The method optimizes a joint objective $\mathcal{L} = \lambda_0\mathcal{L}_0 + \lambda_1\mathcal{L}_{\text{fast}} + \lambda_2\mathcal{L}_{\text{phy}}$ (with $\lambda_0=1.0$, $\lambda_1=\lambda_2=0.1$) to balance accuracy, fast inference, and geometric grounding. Experimental results show state-of-the-art relighting quality, improved physical consistency, and about a 20x speedup for both image and video tasks, highlighting practical impact for real-time or large-scale relighting applications.

Abstract

Relighting is a crucial task with both practical demand and artistic value, and recent diffusion models have shown strong potential by enabling rich and controllable lighting effects. However, as they are typically optimized in semantic latent space, where proximity does not guarantee physical correctness in visual space, they often produce unrealistic results, such as overexposed highlights, misaligned shadows, and incorrect occlusions. We address this with UniLumos, a unified relighting framework for both images and videos that brings RGB-space geometry feedback into a flow matching backbone. By supervising the model with depth and normal maps extracted from its outputs, we explicitly align lighting effects with the scene structure, enhancing physical plausibility. Nevertheless, this feedback requires high-quality outputs for supervision in visual space, making standard multi-step denoising computationally expensive. To mitigate this, we employ path consistency learning, allowing supervision to remain effective even under few-step training regimes. To enable fine-grained relighting control and supervision, we design a structured six-dimensional annotation protocol capturing core illumination attributes. Building upon this, we propose LumosBench, a disentangled attribute-level benchmark that evaluates lighting controllability via large vision-language models, enabling automatic and interpretable assessment of relighting precision across individual dimensions. Extensive experiments demonstrate that UniLumos achieves state-of-the-art relighting quality with significantly improved physical consistency, while delivering a 20x speedup for both image and video relighting. Code is available at https://github.com/alibaba-damo-academy/Lumos-Custom.

UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback

TL;DR

UniLumos tackles physically plausible image and video relighting under flexible prompts by integrating a flow-matching diffusion backbone with physics-plausible RGB-space feedback. It supervises lighting with depth and surface normals, enabling alignment of illumination with scene geometry, and introduces LumosData with a six-dimensional lighting annotation protocol plus LumosBench for attribute-level evaluation. The method optimizes a joint objective (with , ) to balance accuracy, fast inference, and geometric grounding. Experimental results show state-of-the-art relighting quality, improved physical consistency, and about a 20x speedup for both image and video tasks, highlighting practical impact for real-time or large-scale relighting applications.

Abstract

Relighting is a crucial task with both practical demand and artistic value, and recent diffusion models have shown strong potential by enabling rich and controllable lighting effects. However, as they are typically optimized in semantic latent space, where proximity does not guarantee physical correctness in visual space, they often produce unrealistic results, such as overexposed highlights, misaligned shadows, and incorrect occlusions. We address this with UniLumos, a unified relighting framework for both images and videos that brings RGB-space geometry feedback into a flow matching backbone. By supervising the model with depth and normal maps extracted from its outputs, we explicitly align lighting effects with the scene structure, enhancing physical plausibility. Nevertheless, this feedback requires high-quality outputs for supervision in visual space, making standard multi-step denoising computationally expensive. To mitigate this, we employ path consistency learning, allowing supervision to remain effective even under few-step training regimes. To enable fine-grained relighting control and supervision, we design a structured six-dimensional annotation protocol capturing core illumination attributes. Building upon this, we propose LumosBench, a disentangled attribute-level benchmark that evaluates lighting controllability via large vision-language models, enabling automatic and interpretable assessment of relighting precision across individual dimensions. Extensive experiments demonstrate that UniLumos achieves state-of-the-art relighting quality with significantly improved physical consistency, while delivering a 20x speedup for both image and video relighting. Code is available at https://github.com/alibaba-damo-academy/Lumos-Custom.

Paper Structure

This paper contains 19 sections, 9 equations, 11 figures, 7 tables, 1 algorithm.

Figures (11)

  • Figure 1: UniLumos performs physically plausible image and video relighting, conditioned on textual prompts and reference videos.
  • Figure 2: The overall pipeline of UniLumos. The left is LumosData, our proposed data construction pipeline, which consists of four stages for generating diverse relighting pairs from real-world sources. The right shows the architecture of UniLumos, a unified framework for image and video relighting, designed to achieve physically plausible illumination control.
  • Figure 3: Qualitative comparison of baseline methods. Each method takes a subject video and a textual illumination description as input, generating the related subject with the corresponding background under the specified lighting condition.
  • Figure 4: UniLumos performs physically plausible video relighting conditioned on different reference videos.
  • Figure 5: Comparison of inference time costs of different methods under the same settings.
  • ...and 6 more figures