Table of Contents
Fetching ...

Input-Aware Sparse Attention for Real-Time Co-Speech Video Generation

Beijia Lu, Ziyi Chen, Jing Xiao, Jun-Yan Zhu

TL;DR

The paper tackles the challenge of real-time co-speech video generation by introducing input-aware sparse attention guided by input pose and a region-focused distillation loss. By distilling a slow teacher diffusion model into a fast student and leveraging pose-conditioned attention, the approach achieves real-time synthesis with improved lip synchronization and hand motion realism. Extensive experiments on TalkShow and a YouTube Talking Video dataset show about a 3× speedup over baselines and notable gains in perceptual metrics, while ablations validate the contribution of both attention and distillation components. The method enables scalable, high-quality co-speech avatars, with practical potential for virtual agents and telepresence, though it acknowledges limitations with dynamic backgrounds and finger-level details and discusses ethical considerations for deployment.

Abstract

Diffusion models can synthesize realistic co-speech video from audio for various applications, such as video creation and virtual agents. However, existing diffusion-based methods are slow due to numerous denoising steps and costly attention mechanisms, preventing real-time deployment. In this work, we distill a many-step diffusion video model into a few-step student model. Unfortunately, directly applying recent diffusion distillation methods degrades video quality and falls short of real-time performance. To address these issues, our new video distillation method leverages input human pose conditioning for both attention and loss functions. We first propose using accurate correspondence between input human pose keypoints to guide attention to relevant regions, such as the speaker's face, hands, and upper body. This input-aware sparse attention reduces redundant computations and strengthens temporal correspondences of body parts, improving inference efficiency and motion coherence. To further enhance visual quality, we introduce an input-aware distillation loss that improves lip synchronization and hand motion realism. By integrating our input-aware sparse attention and distillation loss, our method achieves real-time performance with improved visual quality compared to recent audio-driven and input-driven methods. We also conduct extensive experiments showing the effectiveness of our algorithmic design choices.

Input-Aware Sparse Attention for Real-Time Co-Speech Video Generation

TL;DR

The paper tackles the challenge of real-time co-speech video generation by introducing input-aware sparse attention guided by input pose and a region-focused distillation loss. By distilling a slow teacher diffusion model into a fast student and leveraging pose-conditioned attention, the approach achieves real-time synthesis with improved lip synchronization and hand motion realism. Extensive experiments on TalkShow and a YouTube Talking Video dataset show about a 3× speedup over baselines and notable gains in perceptual metrics, while ablations validate the contribution of both attention and distillation components. The method enables scalable, high-quality co-speech avatars, with practical potential for virtual agents and telepresence, though it acknowledges limitations with dynamic backgrounds and finger-level details and discusses ethical considerations for deployment.

Abstract

Diffusion models can synthesize realistic co-speech video from audio for various applications, such as video creation and virtual agents. However, existing diffusion-based methods are slow due to numerous denoising steps and costly attention mechanisms, preventing real-time deployment. In this work, we distill a many-step diffusion video model into a few-step student model. Unfortunately, directly applying recent diffusion distillation methods degrades video quality and falls short of real-time performance. To address these issues, our new video distillation method leverages input human pose conditioning for both attention and loss functions. We first propose using accurate correspondence between input human pose keypoints to guide attention to relevant regions, such as the speaker's face, hands, and upper body. This input-aware sparse attention reduces redundant computations and strengthens temporal correspondences of body parts, improving inference efficiency and motion coherence. To further enhance visual quality, we introduce an input-aware distillation loss that improves lip synchronization and hand motion realism. By integrating our input-aware sparse attention and distillation loss, our method achieves real-time performance with improved visual quality compared to recent audio-driven and input-driven methods. We also conduct extensive experiments showing the effectiveness of our algorithmic design choices.

Paper Structure

This paper contains 34 sections, 6 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Our two-stage co-speech video generation pipeline. In Stage 1, an audio input and a reference image are fed into an audio-to-motion generator liu2023emage to produce motion sequences represented by dense pose keypoints. In Stage 2, these motion sequences are fed into our efficient student video generation network $G_{\theta}$, The network is conditioned on features from the reference image, which are separately encoded by the VAE encoder rombach2022high and CLIP encoder radford2021learning, to synthesize the final video. In our work, we focus on accelerating the video generation, significantly speeding up the teacher model zhang2024mimicmotion.
  • Figure 2: Input-Aware Sparse Attention. Our attention mechanism selectively focuses on tokens within salient body regions and their corresponding areas in temporally relevant frames. (a) We first apply global masking, which restricts attention to the $K$ most similar past frames based on pose similarity. (b) Then local masking limits inter-frame attention to matched regions (e.g., face, hands) to enhance temporal coherence. (c) Our input-aware attention masking integrates both global and local masks to form an efficient and structured sparse attention pattern.
  • Figure 3: Qualitative comparison of audio-driven methods. We show body animation results conditioned on the same input audio. Our method not only achieves accurate lip synchronization but also generates clearer hand gestures.
  • Figure 4: Qualitative comparison of pose-driven methods. All methods are conditioned on the same motion sequence and reference image. Our approach produces more realistic faces, hands, and body movements.
  • Figure 5: Additional Qualitative Results. This figure presents a selection of further examples generated by our method. Given only a single static reference image and an input audio clip, our model effectively synthesizes highly realistic and expressive video outputs. These results visually demonstrate its capability to produce natural facial expressions, fluid body movements, and accurate lip synchronization in real time.
  • ...and 3 more figures