High-Fidelity Relightable Monocular Portrait Animation with Lighting-Controllable Video Diffusion Model
Mingtao Guo, Guanyu Xing, Yanli Liu
TL;DR
The paper tackles the challenge of relightable monocular portrait animation by isolating intrinsic identity/appearance from extrinsic pose and lighting. It introduces LCVD, a diffusion-based framework that builds two feature subspaces—extrinsic (via shading hints) and intrinsic (via the reference image)—using shading and reference adapters and merges them within a pre-trained image-to-video diffusion model. Training leverages self-supervised supervision with DECA-derived shading hints and a dual-adapter scheme, while inference employs Composer-like guidance and multi-condition control to manipulate lighting without sacrificing identity. Extensive evaluations show superior lighting realism, image quality, and video fidelity against state-of-the-art methods, and motion alignment and long-sequence generation strategies address identity leakage and sequence length challenges. LCVD thus establishes a practical, high-fidelity approach for relightable portrait animation with controllable lighting, pose, and identity consistency across video sequences.
Abstract
Relightable portrait animation aims to animate a static reference portrait to match the head movements and expressions of a driving video while adapting to user-specified or reference lighting conditions. Existing portrait animation methods fail to achieve relightable portraits because they do not separate and manipulate intrinsic (identity and appearance) and extrinsic (pose and lighting) features. In this paper, we present a Lighting Controllable Video Diffusion model (LCVD) for high-fidelity, relightable portrait animation. We address this limitation by distinguishing these feature types through dedicated subspaces within the feature space of a pre-trained image-to-video diffusion model. Specifically, we employ the 3D mesh, pose, and lighting-rendered shading hints of the portrait to represent the extrinsic attributes, while the reference represents the intrinsic attributes. In the training phase, we employ a reference adapter to map the reference into the intrinsic feature subspace and a shading adapter to map the shading hints into the extrinsic feature subspace. By merging features from these subspaces, the model achieves nuanced control over lighting, pose, and expression in generated animations. Extensive evaluations show that LCVD outperforms state-of-the-art methods in lighting realism, image quality, and video consistency, setting a new benchmark in relightable portrait animation.
