Table of Contents
Fetching ...

IP-FaceDiff: Identity-Preserving Facial Video Editing with Diffusion

Tharun Anand, Aryan Garg, Kaushik Mitra

TL;DR

This work proposes a novel facial video editing framework that leverages the rich latent space of pretrained text-to-image (T2I) diffusion models and fine-tune them specifically for facial video editing tasks, and significantly reduces editing time by 80%, while maintaining temporal consistency throughout the video sequence.

Abstract

Facial video editing has become increasingly important for content creators, enabling the manipulation of facial expressions and attributes. However, existing models encounter challenges such as poor editing quality, high computational costs and difficulties in preserving facial identity across diverse edits. Additionally, these models are often constrained to editing predefined facial attributes, limiting their flexibility to diverse editing prompts. To address these challenges, we propose a novel facial video editing framework that leverages the rich latent space of pre-trained text-to-image (T2I) diffusion models and fine-tune them specifically for facial video editing tasks. Our approach introduces a targeted fine-tuning scheme that enables high quality, localized, text-driven edits while ensuring identity preservation across video frames. Additionally, by using pre-trained T2I models during inference, our approach significantly reduces editing time by 80%, while maintaining temporal consistency throughout the video sequence. We evaluate the effectiveness of our approach through extensive testing across a wide range of challenging scenarios, including varying head poses, complex action sequences, and diverse facial expressions. Our method consistently outperforms existing techniques, demonstrating superior performance across a broad set of metrics and benchmarks.

IP-FaceDiff: Identity-Preserving Facial Video Editing with Diffusion

TL;DR

This work proposes a novel facial video editing framework that leverages the rich latent space of pretrained text-to-image (T2I) diffusion models and fine-tune them specifically for facial video editing tasks, and significantly reduces editing time by 80%, while maintaining temporal consistency throughout the video sequence.

Abstract

Facial video editing has become increasingly important for content creators, enabling the manipulation of facial expressions and attributes. However, existing models encounter challenges such as poor editing quality, high computational costs and difficulties in preserving facial identity across diverse edits. Additionally, these models are often constrained to editing predefined facial attributes, limiting their flexibility to diverse editing prompts. To address these challenges, we propose a novel facial video editing framework that leverages the rich latent space of pre-trained text-to-image (T2I) diffusion models and fine-tune them specifically for facial video editing tasks. Our approach introduces a targeted fine-tuning scheme that enables high quality, localized, text-driven edits while ensuring identity preservation across video frames. Additionally, by using pre-trained T2I models during inference, our approach significantly reduces editing time by 80%, while maintaining temporal consistency throughout the video sequence. We evaluate the effectiveness of our approach through extensive testing across a wide range of challenging scenarios, including varying head poses, complex action sequences, and diverse facial expressions. Our method consistently outperforms existing techniques, demonstrating superior performance across a broad set of metrics and benchmarks.
Paper Structure (23 sections, 10 equations, 5 figures, 3 tables)

This paper contains 23 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Model Architecture. Left: Pre-trained T2I models $\epsilon_1$ and $\epsilon_2$ are fine-tuned independently with ArcFace loss and directional CLIP loss for identity-preservation and prompt-adhering localization, respectively. Right: Video frames are inverted with DDIM and then processed through $\epsilon_1$ to extract self-attention features at each timestep. Text-guided editing is applied to keyframes using $\epsilon_2$, guided by identity features from $\epsilon_1$. Edits are propagated to remaining frames using a nearest-neighbor search within the latent space.
  • Figure 2: Strong Prompt-Adhering Multiple Editing.Left: Beyond local edits, our method manipulates facial expressions and age. Right: Our facial video editing method handles simultaneous edits to local facial features.
  • Figure 3: Editing Faces in the Wild. We successfully overcome a previous hurdle of out-of-domain adaptation for facial video editing methods.
  • Figure 4: Ablation: fine-tuning $\epsilon_1$ and $\epsilon_2$ with Arc-Face and directional-Clip Loss for identity preservation and performing localized edits in facial videos.
  • Figure 5: More Identity Preserving Localized Editing in the Wild. Left: Hair color change and accessory addition are performed from a randomly scraped online music video. Right: Facial hair and facial expressions of a consenting volunteer are edited using our method.