Anchored Diffusion for Video Face Reenactment
Idan Kligvasser, Regev Cohen, George Leifman, Ehud Rivlin, Michael Elad
TL;DR
This work tackles the challenge of generating long, temporally coherent videos under memory constraints by introducing Anchored Diffusion. It builds a temporal diffusion-transformer denoiser, sDiT, trained on non-uniform frame sequences and guided by global CLIP and per-frame landmark signals, with an anchored inference mechanism that aligns multiple sequences to a shared anchor frame to ensure consistency across time. The approach is demonstrated on neural face reenactment, supported by a large-scale dataset (ReenactFaces-1M with over 1M clips from 53k identities) and extended capabilities to text-to-video and semantic editing via CLIP guidance. Results show longer, higher-quality, and more consistent videos than prior methods, highlighting practical impact for video synthesis, editing, and cross-domain applications beyond face reenactment.
Abstract
Video generation has drawn significant interest recently, pushing the development of large-scale models capable of producing realistic videos with coherent motion. Due to memory constraints, these models typically generate short video segments that are then combined into long videos. The merging process poses a significant challenge, as it requires ensuring smooth transitions and overall consistency. In this paper, we introduce Anchored Diffusion, a novel method for synthesizing relatively long and seamless videos. We extend Diffusion Transformers (DiTs) to incorporate temporal information, creating our sequence-DiT (sDiT) model for generating short video segments. Unlike previous works, we train our model on video sequences with random non-uniform temporal spacing and incorporate temporal information via external guidance, increasing flexibility and allowing it to capture both short and long-term relationships. Furthermore, during inference, we leverage the transformer architecture to modify the diffusion process, generating a batch of non-uniform sequences anchored to a common frame, ensuring consistency regardless of temporal distance. To demonstrate our method, we focus on face reenactment, the task of creating a video from a source image that replicates the facial expressions and movements from a driving video. Through comprehensive experiments, we show our approach outperforms current techniques in producing longer consistent high-quality videos while offering editing capabilities.
