Table of Contents
Fetching ...

UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation

Hebeizi Li, Zihao Liang, Benyuan Sun, Zihao Yin, Xiao Sha, Chenliang Wang, Yi Yang

TL;DR

UniTalking, a unified, end-to-end diffusion framework for generating high-fidelity speech and lip-synchronized video, employs Multi-Modal Transformer Blocks to explicitly model the fine-grained temporal correspondence between audio and video latent tokens via a shared self-attention mechanism.

Abstract

While state-of-the-art audio-video generation models like Veo3 and Sora2 demonstrate remarkable capabilities, their closed-source nature makes their architectures and training paradigms inaccessible. To bridge this gap in accessibility and performance, we introduce UniTalking, a unified, end-to-end diffusion framework for generating high-fidelity speech and lip-synchronized video. At its core, our framework employs Multi-Modal Transformer Blocks to explicitly model the fine-grained temporal correspondence between audio and video latent tokens via a shared self-attention mechanism. By leveraging powerful priors from a pre-trained video generation model, our framework ensures state-of-the-art visual fidelity while enabling efficient training. Furthermore, UniTalking incorporates a personalized voice cloning capability, allowing the generation of speech in a target style from a brief audio reference. Qualitative and quantitative results demonstrate that our method produces highly realistic talking portraits, achieving superior performance over existing open-source approaches in lip-sync accuracy, audio naturalness, and overall perceptual quality.

UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation

TL;DR

UniTalking, a unified, end-to-end diffusion framework for generating high-fidelity speech and lip-synchronized video, employs Multi-Modal Transformer Blocks to explicitly model the fine-grained temporal correspondence between audio and video latent tokens via a shared self-attention mechanism.

Abstract

While state-of-the-art audio-video generation models like Veo3 and Sora2 demonstrate remarkable capabilities, their closed-source nature makes their architectures and training paradigms inaccessible. To bridge this gap in accessibility and performance, we introduce UniTalking, a unified, end-to-end diffusion framework for generating high-fidelity speech and lip-synchronized video. At its core, our framework employs Multi-Modal Transformer Blocks to explicitly model the fine-grained temporal correspondence between audio and video latent tokens via a shared self-attention mechanism. By leveraging powerful priors from a pre-trained video generation model, our framework ensures state-of-the-art visual fidelity while enabling efficient training. Furthermore, UniTalking incorporates a personalized voice cloning capability, allowing the generation of speech in a target style from a brief audio reference. Qualitative and quantitative results demonstrate that our method produces highly realistic talking portraits, achieving superior performance over existing open-source approaches in lip-sync accuracy, audio naturalness, and overall perceptual quality.
Paper Structure (20 sections, 2 equations, 6 figures, 3 tables)

This paper contains 20 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Illustration of our unified audio-video framework for talking portrait generation. UniTalking facilitates the generation of synchronized audio and video through multiple input modalities.
  • Figure 2: The data processing pipeline for curating our human-centric audio-video dataset. The pipeline employs sequential filtering, beginning with single-modality (video and audio) checks, followed by a cross-modal filtering stage. The final, high-quality audio-video pairs are then annotated with multi-level, multi-modal captions.
  • Figure 3: The architecture of UniTalking. Our framework jointly generates synchronized audio and video for talking portraits using a dual-stream Multi-Modal DiT (MM-DiT) backbone, which is trained as a Continuous Normalizing Flow via Flow Matching. The core of the model is the Multi-Modal Transformer Block, detailed on the right. It employs Joint Attention on concatenated audio-video tokens to ensure precise temporal alignment, and Cross Attention to incorporate multi-modal conditions such as text and acoustic style. Frozen components are marked with a snowflake, while trainable parts are marked with a flame.
  • Figure 4: Visualization of video frames and Mel-spectrograms for generated audio-visual talking portraits and ground-truth audio-visual data. At the top is the corresponding text prompt, with certain keywords highlighted in bold and red. The generated video and audio are semantically consistent with the ground truth.
  • Figure 5: Visualization of attention maps between video tokens and audio tokens in Joint Attention. Subfigures (a) and (b) show two different random samples. The first row in the subgraph represents audio-to-video attention, while the second row represents video-to-audio attention. Different columns correspond to weights derived from distinct inference sampling steps and different transformer blocks.
  • ...and 1 more figures