Table of Contents
Fetching ...

TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models

Chetwin Low, Weimin Wang

TL;DR

TalkingMachines reimagines real time avatar generation by converting a bidirectional video diffusion model into an autoregressive, audio driven system. The core approach combines Flow Matching based conditioning with Distribution Matching Distillation to enable infinite length, low-latency video streams distilled down to just 2 diffusion steps. System level optimizations, including Score-VAE disaggregation, inter device communication overlap, and memory caching, enable real time FaceTime style avatars with lip sync and multi style support. The work demonstrates practical real time deployment for interactive conversations and highlights pathways for future scale and cross domain generality.

Abstract

In this paper, we present TalkingMachines -- an efficient framework that transforms pretrained video generation models into real-time, audio-driven character animators. TalkingMachines enables natural conversational experiences by integrating an audio large language model (LLM) with our video generation foundation model. Our primary contributions include: (1) We adapt a pretrained SOTA image-to-video DiT into an audio-driven avatar generation model of 18 billion parameters; (2) We enable infinite video streaming without error accumulation through asymmetric knowledge distillation from a bidirectional teacher model into a sparse causal, autoregressive student model; (3) We design a high-throughput, low-latency inference pipeline incorporating several key engineering optimizations such as: (a) disaggregation of the DiT and VAE decoder across separate devices, (b) efficient overlap of inter-device communication and computation using CUDA streams, (c) elimination of redundant recomputations to maximize frame-generation throughput. Please see demo videos here - https://aaxwaz.github.io/TalkingMachines/

TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models

TL;DR

TalkingMachines reimagines real time avatar generation by converting a bidirectional video diffusion model into an autoregressive, audio driven system. The core approach combines Flow Matching based conditioning with Distribution Matching Distillation to enable infinite length, low-latency video streams distilled down to just 2 diffusion steps. System level optimizations, including Score-VAE disaggregation, inter device communication overlap, and memory caching, enable real time FaceTime style avatars with lip sync and multi style support. The work demonstrates practical real time deployment for interactive conversations and highlights pathways for future scale and cross domain generality.

Abstract

In this paper, we present TalkingMachines -- an efficient framework that transforms pretrained video generation models into real-time, audio-driven character animators. TalkingMachines enables natural conversational experiences by integrating an audio large language model (LLM) with our video generation foundation model. Our primary contributions include: (1) We adapt a pretrained SOTA image-to-video DiT into an audio-driven avatar generation model of 18 billion parameters; (2) We enable infinite video streaming without error accumulation through asymmetric knowledge distillation from a bidirectional teacher model into a sparse causal, autoregressive student model; (3) We design a high-throughput, low-latency inference pipeline incorporating several key engineering optimizations such as: (a) disaggregation of the DiT and VAE decoder across separate devices, (b) efficient overlap of inter-device communication and computation using CUDA streams, (c) elimination of redundant recomputations to maximize frame-generation throughput. Please see demo videos here - https://aaxwaz.github.io/TalkingMachines/

Paper Structure

This paper contains 23 sections, 7 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: TalkingMachines provides a framework to generate highly dynamic, immersive FaceTime experiences based on different character styles.
  • Figure 2: Overview of the DMD training workflow for TalkingMachines. The diagram illustrates the asymmetric distillation process where a bidirectional teacher model is distilled into an autoregressive student model. The training incorporates mixed data generation with synthetic samples from the student model, sparse causal attention patterns across chunks, and a combination of DMD loss with regression loss for stable convergence.
  • Figure 3: Runtime analysis comparing the latency of various server designs like a simple self-contained server, and our Score-VAE disaggregation server, both with and without Sequence Parallelism.