Table of Contents
Fetching ...

Comprehensive Relighting: Generalizable and Consistent Monocular Human Relighting and Harmonization

Junying Wang, Jingyuan Liu, Xin Sun, Krishna Kumar Singh, Zhixin Shu, He Zhang, Jimei Yang, Nanxuan Zhao, Tuanfeng Y. Wang, Simon S. Chen, Ulrich Neumann, Jae Shin Yoon

TL;DR

This work tackles monocular human relighting and background harmonization across arbitrary body parts by leveraging a pre-trained diffusion prior within a coarse-to-fine framework. It introduces an unsupervised temporal lighting model that enforces cycle-consistency across real videos, and a spatio-temporal blending scheme with a guided refinement to maintain temporal coherence and preserve high-frequency details. The method jointly learns relighting and background harmonization conditioned on a coarse SH-based lighting representation $\boldsymbol{\phi}$ and a background image $\mathbf{B}$, enabling control over target lighting and backgrounds with generalization to portraits, half-body, full-body, and multi-person scenes. Experimental results show strong generalization and temporal coherence, outperforming existing image-based relighting and harmonization methods, with a detailed ablation study validating the contributions of the coarse-to-fine design, temporal module, and refinement stage.

Abstract

This paper introduces Comprehensive Relighting, the first all-in-one approach that can both control and harmonize the lighting from an image or video of humans with arbitrary body parts from any scene. Building such a generalizable model is extremely challenging due to the lack of dataset, restricting existing image-based relighting models to a specific scenario (e.g., face or static human). To address this challenge, we repurpose a pre-trained diffusion model as a general image prior and jointly model the human relighting and background harmonization in the coarse-to-fine framework. To further enhance the temporal coherence of the relighting, we introduce an unsupervised temporal lighting model that learns the lighting cycle consistency from many real-world videos without any ground truth. In inference time, our temporal lighting module is combined with the diffusion models through the spatio-temporal feature blending algorithms without extra training; and we apply a new guided refinement as a post-processing to preserve the high-frequency details from the input image. In the experiments, Comprehensive Relighting shows a strong generalizability and lighting temporal coherence, outperforming existing image-based human relighting and harmonization methods.

Comprehensive Relighting: Generalizable and Consistent Monocular Human Relighting and Harmonization

TL;DR

This work tackles monocular human relighting and background harmonization across arbitrary body parts by leveraging a pre-trained diffusion prior within a coarse-to-fine framework. It introduces an unsupervised temporal lighting model that enforces cycle-consistency across real videos, and a spatio-temporal blending scheme with a guided refinement to maintain temporal coherence and preserve high-frequency details. The method jointly learns relighting and background harmonization conditioned on a coarse SH-based lighting representation and a background image , enabling control over target lighting and backgrounds with generalization to portraits, half-body, full-body, and multi-person scenes. Experimental results show strong generalization and temporal coherence, outperforming existing image-based relighting and harmonization methods, with a detailed ablation study validating the contributions of the coarse-to-fine design, temporal module, and refinement stage.

Abstract

This paper introduces Comprehensive Relighting, the first all-in-one approach that can both control and harmonize the lighting from an image or video of humans with arbitrary body parts from any scene. Building such a generalizable model is extremely challenging due to the lack of dataset, restricting existing image-based relighting models to a specific scenario (e.g., face or static human). To address this challenge, we repurpose a pre-trained diffusion model as a general image prior and jointly model the human relighting and background harmonization in the coarse-to-fine framework. To further enhance the temporal coherence of the relighting, we introduce an unsupervised temporal lighting model that learns the lighting cycle consistency from many real-world videos without any ground truth. In inference time, our temporal lighting module is combined with the diffusion models through the spatio-temporal feature blending algorithms without extra training; and we apply a new guided refinement as a post-processing to preserve the high-frequency details from the input image. In the experiments, Comprehensive Relighting shows a strong generalizability and lighting temporal coherence, outperforming existing image-based human relighting and harmonization methods.

Paper Structure

This paper contains 26 sections, 11 equations, 25 figures, 6 tables, 1 algorithm.

Figures (25)

  • Figure 1: We introduce Comprehensive Relighting, a generalizable and consistent model for relighting and harmonization, which controls the lighting property from a single image or video of humans with arbitrary body parts. Given target lighting coefficients, e.g., Spherical harmonics (second), background scenes (third), or their combination (fourth), our model performs consistent and harmonized relighting.
  • Figure 2: Comparison of various baseline methods for relighting settings and functionalities.
  • Figure 3: Our model generalizes to various body parts (portrait, half-body, full-body, multiperson) for relighting and harmonization, with lighting control variables shown in the insets.
  • Figure 4: System overview. (a) Given an input image of humans with coarse lighting and background image, our diffusion model generates the relit images harmonized with background scenes (Sec. \ref{['sec2']}). (b) The external temporal modules learn the temporal cycle consistency from many real-world videos to construct temporal lighting features (Sec. \ref{['sec3']}). (c) In inference time, we blend the features from lighting and temporal modules spatially and temporally to enable coherent and generalizable human relighting (Sec. \ref{['inference']}).
  • Figure 5: Qualitative comparison of synthetic video frames (corresponding to Tab. \ref{['tab:table2']}). From left to right: composite input with target lighting parameters (inset), our relit result, baseline methods, and normalized L2$\downarrow$ photometric error map (inset).
  • ...and 20 more figures