TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models
Chetwin Low, Weimin Wang
TL;DR
TalkingMachines reimagines real time avatar generation by converting a bidirectional video diffusion model into an autoregressive, audio driven system. The core approach combines Flow Matching based conditioning with Distribution Matching Distillation to enable infinite length, low-latency video streams distilled down to just 2 diffusion steps. System level optimizations, including Score-VAE disaggregation, inter device communication overlap, and memory caching, enable real time FaceTime style avatars with lip sync and multi style support. The work demonstrates practical real time deployment for interactive conversations and highlights pathways for future scale and cross domain generality.
Abstract
In this paper, we present TalkingMachines -- an efficient framework that transforms pretrained video generation models into real-time, audio-driven character animators. TalkingMachines enables natural conversational experiences by integrating an audio large language model (LLM) with our video generation foundation model. Our primary contributions include: (1) We adapt a pretrained SOTA image-to-video DiT into an audio-driven avatar generation model of 18 billion parameters; (2) We enable infinite video streaming without error accumulation through asymmetric knowledge distillation from a bidirectional teacher model into a sparse causal, autoregressive student model; (3) We design a high-throughput, low-latency inference pipeline incorporating several key engineering optimizations such as: (a) disaggregation of the DiT and VAE decoder across separate devices, (b) efficient overlap of inter-device communication and computation using CUDA streams, (c) elimination of redundant recomputations to maximize frame-generation throughput. Please see demo videos here - https://aaxwaz.github.io/TalkingMachines/
