Table of Contents
Fetching ...

MoCha: Towards Movie-Grade Talking Character Synthesis

Cong Wei, Bo Sun, Haoyu Ma, Ji Hou, Felix Juefei-Xu, Zecheng He, Xiaoliang Dai, Luxin Zhang, Kunpeng Li, Tingbo Hou, Animesh Sinha, Peter Vajda, Wenhu Chen

TL;DR

The paper defines Talking Characters and introduces MoCha, an end-to-end diffusion-transformer that generates full-body talking character videos from speech and text input. Key innovations include a speech-video window attention mechanism for precise lip-sync, a joint training strategy leveraging both speech-labeled and text-labeled video data, and a structured prompt design enabling multi-character conversations. MoCha achieves state-of-the-art performance on MoCha-Bench with both automatic metrics and human judgments across lip-sync, expression, action naturalness, text alignment, and visual quality, and demonstrates robust generalization and cinematic coherence. This work advances controllable, narrative-driven AI video synthesis for multi-character scenes and has broad implications for automated film production and animation.

Abstract

Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text. Unlike talking head, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region. In this paper, we propose MoCha, the first of its kind to generate talking characters. To ensure precise synchronization between video and speech, we propose a speech-video window attention mechanism that effectively aligns speech and video tokens. To address the scarcity of large-scale speech-labeled video datasets, we introduce a joint training strategy that leverages both speech-labeled and text-labeled video data, significantly improving generalization across diverse character actions. We also design structured prompt templates with character tags, enabling, for the first time, multi-character conversation with turn-based dialogue-allowing AI-generated characters to engage in context-aware conversations with cinematic coherence. Extensive qualitative and quantitative evaluations, including human preference studies and benchmark comparisons, demonstrate that MoCha sets a new standard for AI-generated cinematic storytelling, achieving superior realism, expressiveness, controllability and generalization.

MoCha: Towards Movie-Grade Talking Character Synthesis

TL;DR

The paper defines Talking Characters and introduces MoCha, an end-to-end diffusion-transformer that generates full-body talking character videos from speech and text input. Key innovations include a speech-video window attention mechanism for precise lip-sync, a joint training strategy leveraging both speech-labeled and text-labeled video data, and a structured prompt design enabling multi-character conversations. MoCha achieves state-of-the-art performance on MoCha-Bench with both automatic metrics and human judgments across lip-sync, expression, action naturalness, text alignment, and visual quality, and demonstrates robust generalization and cinematic coherence. This work advances controllable, narrative-driven AI video synthesis for multi-character scenes and has broad implications for automated film production and animation.

Abstract

Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text. Unlike talking head, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region. In this paper, we propose MoCha, the first of its kind to generate talking characters. To ensure precise synchronization between video and speech, we propose a speech-video window attention mechanism that effectively aligns speech and video tokens. To address the scarcity of large-scale speech-labeled video datasets, we introduce a joint training strategy that leverages both speech-labeled and text-labeled video data, significantly improving generalization across diverse character actions. We also design structured prompt templates with character tags, enabling, for the first time, multi-character conversation with turn-based dialogue-allowing AI-generated characters to engage in context-aware conversations with cinematic coherence. Extensive qualitative and quantitative evaluations, including human preference studies and benchmark comparisons, demonstrate that MoCha sets a new standard for AI-generated cinematic storytelling, achieving superior realism, expressiveness, controllability and generalization.

Paper Structure

This paper contains 19 sections, 4 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: MoCha Architecture. MoCha is a end-to-end Diffusion Transformer model that generates video frames from the joint conditioning of speech and text, without relying on any auxiliary signals. Both speech and text inputs are projected into token representations and aligned with video tokens through cross-attention.
  • Figure 2: MoCha's Speech-Video Window Cross Attention MoCha generates all video frames in parallel using a window cross-attention mechanism, where each video token attends to a local window of audio tokens to improve alignment and lip-sync quality.
  • Figure 3: Multi-character Conversation and Character Tagging. MoCha supports generates multi-character conversion with scene cut. We design a specialized prompt template: it first specifies the number of clips, then introduces the characters along with their descriptions and associated tags. Each clip is subsequently described using only the character tags, simplifying the prompt while preserving clarity. MoCha leverages self-attention across video tokens to ensures character and environment consistency. The audio conditioning signal implicitly guides the model on when to transition between clips.
  • Figure 4: Qualitative results of MoCha on MoCha-Bench. MoCha not only generates lip movements that are well-synchronized with the input speech, but also produces natural facial expressions that reflect the prompt along with realistic hand gestures and action movements
  • Figure 5: Multi-Stage Training Strategy for MoCha. Text-Speech Joint training starts with close-up shots where speech conditioning has the strongest influence. At each stage, previous data is reduced by 50%, and harder tasks with weaker speech conditioning are introduced. Stage 0 uses text-only video data to establish a foundation for the future stages.
  • ...and 2 more figures