Table of Contents
Fetching ...

Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation

Yingjie Chen, Shilun Lin, Cai Xing, Qixin Yan, Wenjing Wang, Dingming Liu, Hao Liu, Chen Li, Jing Lyu

Abstract

Recent advances have demonstrated compelling capabilities in synthesizing real individuals into generated videos, reflecting the growing demand for identity-aware content creation. Nevertheless, an openly accessible framework enabling fine-grained control over facial appearance and voice timbre across multiple identities remains unavailable. In this work, we present a unified and scalable framework for identity-aware joint audio-video generation, enabling high-fidelity and consistent personalization. Specifically, we introduce a data curation pipeline that automatically extracts identity-bearing information with paired annotations across audio and visual modalities, covering diverse scenarios from single-subject to multi-subject interactions. We further propose a flexible and scalable identity injection mechanism for single- and multi-subject scenarios, in which both facial appearance and vocal timbre act as identity-bearing control signals. Moreover, in light of modality disparity, we design a multi-stage training strategy to accelerate convergence and enforce cross-modal coherence. Experiments demonstrate the superiority of the proposed framework. For more details and qualitative results, please refer to our webpage: \href{https://chen-yingjie.github.io/projects/Identity-as-Presence}{Identity-as-Presence}.

Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation

Abstract

Recent advances have demonstrated compelling capabilities in synthesizing real individuals into generated videos, reflecting the growing demand for identity-aware content creation. Nevertheless, an openly accessible framework enabling fine-grained control over facial appearance and voice timbre across multiple identities remains unavailable. In this work, we present a unified and scalable framework for identity-aware joint audio-video generation, enabling high-fidelity and consistent personalization. Specifically, we introduce a data curation pipeline that automatically extracts identity-bearing information with paired annotations across audio and visual modalities, covering diverse scenarios from single-subject to multi-subject interactions. We further propose a flexible and scalable identity injection mechanism for single- and multi-subject scenarios, in which both facial appearance and vocal timbre act as identity-bearing control signals. Moreover, in light of modality disparity, we design a multi-stage training strategy to accelerate convergence and enforce cross-modal coherence. Experiments demonstrate the superiority of the proposed framework. For more details and qualitative results, please refer to our webpage: \href{https://chen-yingjie.github.io/projects/Identity-as-Presence}{Identity-as-Presence}.
Paper Structure (14 sections, 6 equations, 7 figures, 3 tables)

This paper contains 14 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Identity-as-Presence supports single-subject and multi-subject personalized joint audio-video generation for facial appearance and vocal timbre.
  • Figure 2: Overview of data curation pipeline for constructing identity-labeled audio-visual data from raw videos. The process involves isolating both visual and auditory identity-specific signals from raw videos, synthesizing comprehensive captions via MLLMs, and rigorously matching audio-visual identities to guarantee precise alignment across video clips to ensure high-fidelity identity consistency.
  • Figure 3: The overall dual-tower DiT network architecture and training framework. The model has five inputs: video, audio, video identity, audio identity, and structured caption. We first extract latents for each modality, apply identity embedding to identity latents, then organize latents with structured position embedding. In DiT, we use asymmetric self-attention for decoupled parameterization. The training contains three strages. Stage 1 for unimodal identity, stage 2 for joint multimodel identity training, and stage 3 for multi-view identity fine-tuning.
  • Figure 4: Comparison with state-of-the-art identity-aware video generation models and joint audio-video generation models.
  • Figure 5: Ablation study on subject anchors and identity embeddings.
  • ...and 2 more figures