SynergyWarpNet: Attention-Guided Cooperative Warping for Neural Portrait Animation
Shihang Li, Zhiqiang Gong, Minming Ye, Yue Gao, Wen Yao
TL;DR
The paper tackles video-driven portrait animation by addressing the complementary strengths and weaknesses of explicit geometric warping and attention-based refinement. It introduces SynergyWarpNet, a three-stage framework combining a density-based optical-flow warping, a reference-augmented correction using cross-attention over multiple references, and a confidence-guided fusion to integrate streams before decoding with a SPADE-based generator. The approach leverages 3D implicit keypoints, Gaussian encoding, and multi-reference textures to robustly handle pose, occlusion, and background regions, achieving state-of-the-art results on VFHQ and HDTF with improved temporal coherence. This hybrid, cooperative architecture offers a practical pathway to high-fidelity, controllable talking-head synthesis suitable for avatars and digital communication systems.
Abstract
Recent advances in neural portrait animation have demonstrated remarked potential for applications in virtual avatars, telepresence, and digital content creation. However, traditional explicit warping approaches often struggle with accurate motion transfer or recovering missing regions, while recent attention-based warping methods, though effective, frequently suffer from high complexity and weak geometric grounding. To address these issues, we propose SynergyWarpNet, an attention-guided cooperative warping framework designed for high-fidelity talking head synthesis. Given a source portrait, a driving image, and a set of reference images, our model progressively refines the animation in three stages. First, an explicit warping module performs coarse spatial alignment between the source and driving image using 3D dense optical flow. Next, a reference-augmented correction module leverages cross-attention across 3D keypoints and texture features from multiple reference images to semantically complete occluded or distorted regions. Finally, a confidence-guided fusion module integrates the warped outputs with spatially-adaptive fusing, using a learned confidence map to balance structural alignment and visual consistency. Comprehensive evaluations on benchmark datasets demonstrate state-of-the-art performance.
