Table of Contents
Fetching ...

High-Fidelity Relightable Monocular Portrait Animation with Lighting-Controllable Video Diffusion Model

Mingtao Guo, Guanyu Xing, Yanli Liu

TL;DR

The paper tackles the challenge of relightable monocular portrait animation by isolating intrinsic identity/appearance from extrinsic pose and lighting. It introduces LCVD, a diffusion-based framework that builds two feature subspaces—extrinsic (via shading hints) and intrinsic (via the reference image)—using shading and reference adapters and merges them within a pre-trained image-to-video diffusion model. Training leverages self-supervised supervision with DECA-derived shading hints and a dual-adapter scheme, while inference employs Composer-like guidance and multi-condition control to manipulate lighting without sacrificing identity. Extensive evaluations show superior lighting realism, image quality, and video fidelity against state-of-the-art methods, and motion alignment and long-sequence generation strategies address identity leakage and sequence length challenges. LCVD thus establishes a practical, high-fidelity approach for relightable portrait animation with controllable lighting, pose, and identity consistency across video sequences.

Abstract

Relightable portrait animation aims to animate a static reference portrait to match the head movements and expressions of a driving video while adapting to user-specified or reference lighting conditions. Existing portrait animation methods fail to achieve relightable portraits because they do not separate and manipulate intrinsic (identity and appearance) and extrinsic (pose and lighting) features. In this paper, we present a Lighting Controllable Video Diffusion model (LCVD) for high-fidelity, relightable portrait animation. We address this limitation by distinguishing these feature types through dedicated subspaces within the feature space of a pre-trained image-to-video diffusion model. Specifically, we employ the 3D mesh, pose, and lighting-rendered shading hints of the portrait to represent the extrinsic attributes, while the reference represents the intrinsic attributes. In the training phase, we employ a reference adapter to map the reference into the intrinsic feature subspace and a shading adapter to map the shading hints into the extrinsic feature subspace. By merging features from these subspaces, the model achieves nuanced control over lighting, pose, and expression in generated animations. Extensive evaluations show that LCVD outperforms state-of-the-art methods in lighting realism, image quality, and video consistency, setting a new benchmark in relightable portrait animation.

High-Fidelity Relightable Monocular Portrait Animation with Lighting-Controllable Video Diffusion Model

TL;DR

The paper tackles the challenge of relightable monocular portrait animation by isolating intrinsic identity/appearance from extrinsic pose and lighting. It introduces LCVD, a diffusion-based framework that builds two feature subspaces—extrinsic (via shading hints) and intrinsic (via the reference image)—using shading and reference adapters and merges them within a pre-trained image-to-video diffusion model. Training leverages self-supervised supervision with DECA-derived shading hints and a dual-adapter scheme, while inference employs Composer-like guidance and multi-condition control to manipulate lighting without sacrificing identity. Extensive evaluations show superior lighting realism, image quality, and video fidelity against state-of-the-art methods, and motion alignment and long-sequence generation strategies address identity leakage and sequence length challenges. LCVD thus establishes a practical, high-fidelity approach for relightable portrait animation with controllable lighting, pose, and identity consistency across video sequences.

Abstract

Relightable portrait animation aims to animate a static reference portrait to match the head movements and expressions of a driving video while adapting to user-specified or reference lighting conditions. Existing portrait animation methods fail to achieve relightable portraits because they do not separate and manipulate intrinsic (identity and appearance) and extrinsic (pose and lighting) features. In this paper, we present a Lighting Controllable Video Diffusion model (LCVD) for high-fidelity, relightable portrait animation. We address this limitation by distinguishing these feature types through dedicated subspaces within the feature space of a pre-trained image-to-video diffusion model. Specifically, we employ the 3D mesh, pose, and lighting-rendered shading hints of the portrait to represent the extrinsic attributes, while the reference represents the intrinsic attributes. In the training phase, we employ a reference adapter to map the reference into the intrinsic feature subspace and a shading adapter to map the shading hints into the extrinsic feature subspace. By merging features from these subspaces, the model achieves nuanced control over lighting, pose, and expression in generated animations. Extensive evaluations show that LCVD outperforms state-of-the-art methods in lighting realism, image quality, and video consistency, setting a new benchmark in relightable portrait animation.

Paper Structure

This paper contains 21 sections, 6 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Qualitative results of our method. The target lighting is applied to the meshes of the driving frames to generate shading hints. Using the shading hints, our relightable portrait animation framework animates and relights the reference frame, e.g., the results within the solid boxes show lighting consistent with the target lighting and poses consistent with the driving frames.
  • Figure 2: Overview of our pipeline for lighting controllable portrait animation. It consists of two main stages: (1) Portrait Attributes Subspace Modeling Stage: We use DECA to encode video frames and extract lighting, pose, and shape parameters, which are rendered as shading hints. After processing the shading hints and reference image through the shading adapter and reference adapter, the two features are randomly selected and fused as guidance to guide the Stable Video Diffusion Model in generating denoised video frames with consistent lighting, pose, identity, and appearance. (2) Relighting and Animation Stage: We render the shading hints using the pose of the portrait from the video, the shape from the reference image, and the spherical harmonics coefficients of the target lighting. After processing the shading hints and reference image through two adapters, we employ multi-condition classifier-free guidance to adjust the magnitude of the extrinsic feature guidance direction, enabling the generation of lighting controllable portrait animations.
  • Figure 3: Qualitative comparisons with DPR zhou2019deep, SMFR hou2021towards, StyleFlow 10.1145/3447648, NFL nerffacelighting, and DiFaReli ponglertnapakorn2023difareli. The first column shows the input video frames, and the remaining columns present relighted results under various lighting conditions. Our method demonstrates more realistic performance, particularly in challenging cases such as side lighting.
  • Figure 4: Qualitative comparison of portrait relighting with NFL nerffacelighting, StyleFlow 10.1145/3447648, and DiFaReli ponglertnapakorn2023difareli on the FFHQ dataset karras2019style. The first column shows the input FFHQ portrait images, and the remaining column display the relighted results under various lighting conditions. Our method demonstrates more realistic results.
  • Figure 5: Qualitative comparison of cross-identity portrait animation with DaGAN hong2022depth, StyleHEAT yin2022styleheat and AnimateAnyone hu2024animate on the HDTF dataset. Our method demonstrates more lifelike results.
  • ...and 3 more figures