UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation

Hebeizi Li; Zihao Liang; Benyuan Sun; Zihao Yin; Xiao Sha; Chenliang Wang; Yi Yang

UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation

Hebeizi Li, Zihao Liang, Benyuan Sun, Zihao Yin, Xiao Sha, Chenliang Wang, Yi Yang

TL;DR

UniTalking, a unified, end-to-end diffusion framework for generating high-fidelity speech and lip-synchronized video, employs Multi-Modal Transformer Blocks to explicitly model the fine-grained temporal correspondence between audio and video latent tokens via a shared self-attention mechanism.

Abstract

While state-of-the-art audio-video generation models like Veo3 and Sora2 demonstrate remarkable capabilities, their closed-source nature makes their architectures and training paradigms inaccessible. To bridge this gap in accessibility and performance, we introduce UniTalking, a unified, end-to-end diffusion framework for generating high-fidelity speech and lip-synchronized video. At its core, our framework employs Multi-Modal Transformer Blocks to explicitly model the fine-grained temporal correspondence between audio and video latent tokens via a shared self-attention mechanism. By leveraging powerful priors from a pre-trained video generation model, our framework ensures state-of-the-art visual fidelity while enabling efficient training. Furthermore, UniTalking incorporates a personalized voice cloning capability, allowing the generation of speech in a target style from a brief audio reference. Qualitative and quantitative results demonstrate that our method produces highly realistic talking portraits, achieving superior performance over existing open-source approaches in lip-sync accuracy, audio naturalness, and overall perceptual quality.

UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation

TL;DR

Abstract

Paper Structure (20 sections, 2 equations, 6 figures, 3 tables)

This paper contains 20 sections, 2 equations, 6 figures, 3 tables.

Introduction
Related Work
Audio-to-Video Portrait Generation
Video-to-Audio Generation
Unified Audio-Video Generation
Data Preparation
Data Processing Pipeline
Reference Data Generation
Method
Preliminaries
Overall Architecture
Latent Representation
Multi-Modal Transformer Block
Training Strategy
Experiments
...and 5 more sections

Figures (6)

Figure 1: Illustration of our unified audio-video framework for talking portrait generation. UniTalking facilitates the generation of synchronized audio and video through multiple input modalities.
Figure 2: The data processing pipeline for curating our human-centric audio-video dataset. The pipeline employs sequential filtering, beginning with single-modality (video and audio) checks, followed by a cross-modal filtering stage. The final, high-quality audio-video pairs are then annotated with multi-level, multi-modal captions.
Figure 3: The architecture of UniTalking. Our framework jointly generates synchronized audio and video for talking portraits using a dual-stream Multi-Modal DiT (MM-DiT) backbone, which is trained as a Continuous Normalizing Flow via Flow Matching. The core of the model is the Multi-Modal Transformer Block, detailed on the right. It employs Joint Attention on concatenated audio-video tokens to ensure precise temporal alignment, and Cross Attention to incorporate multi-modal conditions such as text and acoustic style. Frozen components are marked with a snowflake, while trainable parts are marked with a flame.
Figure 4: Visualization of video frames and Mel-spectrograms for generated audio-visual talking portraits and ground-truth audio-visual data. At the top is the corresponding text prompt, with certain keywords highlighted in bold and red. The generated video and audio are semantically consistent with the ground truth.
Figure 5: Visualization of attention maps between video tokens and audio tokens in Joint Attention. Subfigures (a) and (b) show two different random samples. The first row in the subgraph represents audio-to-video attention, while the second row represents video-to-audio attention. Different columns correspond to weights derived from distinct inference sampling steps and different transformer blocks.
...and 1 more figures

UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation

TL;DR

Abstract

UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)