Table of Contents
Fetching ...

Anchored Diffusion for Video Face Reenactment

Idan Kligvasser, Regev Cohen, George Leifman, Ehud Rivlin, Michael Elad

TL;DR

This work tackles the challenge of generating long, temporally coherent videos under memory constraints by introducing Anchored Diffusion. It builds a temporal diffusion-transformer denoiser, sDiT, trained on non-uniform frame sequences and guided by global CLIP and per-frame landmark signals, with an anchored inference mechanism that aligns multiple sequences to a shared anchor frame to ensure consistency across time. The approach is demonstrated on neural face reenactment, supported by a large-scale dataset (ReenactFaces-1M with over 1M clips from 53k identities) and extended capabilities to text-to-video and semantic editing via CLIP guidance. Results show longer, higher-quality, and more consistent videos than prior methods, highlighting practical impact for video synthesis, editing, and cross-domain applications beyond face reenactment.

Abstract

Video generation has drawn significant interest recently, pushing the development of large-scale models capable of producing realistic videos with coherent motion. Due to memory constraints, these models typically generate short video segments that are then combined into long videos. The merging process poses a significant challenge, as it requires ensuring smooth transitions and overall consistency. In this paper, we introduce Anchored Diffusion, a novel method for synthesizing relatively long and seamless videos. We extend Diffusion Transformers (DiTs) to incorporate temporal information, creating our sequence-DiT (sDiT) model for generating short video segments. Unlike previous works, we train our model on video sequences with random non-uniform temporal spacing and incorporate temporal information via external guidance, increasing flexibility and allowing it to capture both short and long-term relationships. Furthermore, during inference, we leverage the transformer architecture to modify the diffusion process, generating a batch of non-uniform sequences anchored to a common frame, ensuring consistency regardless of temporal distance. To demonstrate our method, we focus on face reenactment, the task of creating a video from a source image that replicates the facial expressions and movements from a driving video. Through comprehensive experiments, we show our approach outperforms current techniques in producing longer consistent high-quality videos while offering editing capabilities.

Anchored Diffusion for Video Face Reenactment

TL;DR

This work tackles the challenge of generating long, temporally coherent videos under memory constraints by introducing Anchored Diffusion. It builds a temporal diffusion-transformer denoiser, sDiT, trained on non-uniform frame sequences and guided by global CLIP and per-frame landmark signals, with an anchored inference mechanism that aligns multiple sequences to a shared anchor frame to ensure consistency across time. The approach is demonstrated on neural face reenactment, supported by a large-scale dataset (ReenactFaces-1M with over 1M clips from 53k identities) and extended capabilities to text-to-video and semantic editing via CLIP guidance. Results show longer, higher-quality, and more consistent videos than prior methods, highlighting practical impact for video synthesis, editing, and cross-domain applications beyond face reenactment.

Abstract

Video generation has drawn significant interest recently, pushing the development of large-scale models capable of producing realistic videos with coherent motion. Due to memory constraints, these models typically generate short video segments that are then combined into long videos. The merging process poses a significant challenge, as it requires ensuring smooth transitions and overall consistency. In this paper, we introduce Anchored Diffusion, a novel method for synthesizing relatively long and seamless videos. We extend Diffusion Transformers (DiTs) to incorporate temporal information, creating our sequence-DiT (sDiT) model for generating short video segments. Unlike previous works, we train our model on video sequences with random non-uniform temporal spacing and incorporate temporal information via external guidance, increasing flexibility and allowing it to capture both short and long-term relationships. Furthermore, during inference, we leverage the transformer architecture to modify the diffusion process, generating a batch of non-uniform sequences anchored to a common frame, ensuring consistency regardless of temporal distance. To demonstrate our method, we focus on face reenactment, the task of creating a video from a source image that replicates the facial expressions and movements from a driving video. Through comprehensive experiments, we show our approach outperforms current techniques in producing longer consistent high-quality videos while offering editing capabilities.
Paper Structure (24 sections, 3 equations, 20 figures, 2 tables, 1 algorithm)

This paper contains 24 sections, 3 equations, 20 figures, 2 tables, 1 algorithm.

Figures (20)

  • Figure 1: Sample results generated by Anchored Diffusion for face reenactment given a driving video (top row), including image-to-video generation (second row), text-to-video generation (third row), and video editing (bottom row).
  • Figure 2: Scheme Overview.Left: Our video generation pipeline operates in latent space, where the sDiT denoiser is trained with per-frame guidance from CLIP embeddings and facial landmarks, using a weighted mean-square error loss to optimize the recovery of the driving video. Right: Our Sequence DiT (sDiT) architecture extends the DiT model for image generation to video generation by incorporating temporal dimensions and temporal positional encoding.
  • Figure 3: Anchored Diffusion. We illustrate our strategy for merging multiple generated sequences into long videos, highlighting the main difference from a recent approach used in previous works. (a) Multidiffusion bar2024lumierebar2023multidiffusion generates multiple uniform sequences with overlapping windows of adjacent anchor frames, achieving temporal consistency through averaging. (b) In contrast, our framework samples non-uniform sequences, with consistency between groups maintained by aligning all frames to a single frame shared across all groups.
  • Figure 4: Qualitative Consistency Comparison. We use sDiT-XL model, capable of generating $4$ frames at once, to create a $12$-frame video. Multidiffusion fails to maintain consistency, as evident from the changing outfit of the person across the video. In contrast, our anchored diffusion demonstrates notable consistency throughout the video.
  • Figure 5: Consistency Evaluation. Comparing our approach to Multidiffusion for generating long videos. We generated 50 self-reenactment videos per method and measured the average self cosine similarity (Self-CSIM), described in \ref{['subsec:metrics']}, between the generated and the driving video embeddings. Our method demonstrates superior consistency (lower values), with the margin further increasing as video length grows.
  • ...and 15 more figures