Table of Contents
Fetching ...

AV-Flow: Transforming Text to Audio-Visual Human-like Interactions

Aggelina Chatziagapi, Louis-Philippe Morency, Hongyu Gong, Michael Zollhoefer, Dimitris Samaras, Alexander Richard

TL;DR

AV-Flow presents a text-driven approach to synthesize synchronized speech and 4D facial/head motion for photo-realistic avatars using two interconnected diffusion transformers trained with flow matching. The method enables both monadic and empathetic dyadic interactions by conditioning on user audio-visual input and introducing cross-modal highway fusion to tightly couple speech and motion. It achieves state-of-the-art performance on lip-sync, realism, and audio-visual alignment while delivering fast inference and a text-to-tokens path for end-to-end text input. This work advances natural human-AI interaction by delivering an always-on, listening avatar capable of active response in conversations, with careful attention to ethical use and consent.

Abstract

We introduce AV-Flow, an audio-visual generative model that animates photo-realistic 4D talking avatars given only text input. In contrast to prior work that assumes an existing speech signal, we synthesize speech and vision jointly. We demonstrate human-like speech synthesis, synchronized lip motion, lively facial expressions and head pose; all generated from just text characters. The core premise of our approach lies in the architecture of our two parallel diffusion transformers. Intermediate highway connections ensure communication between the audio and visual modalities, and thus, synchronized speech intonation and facial dynamics (e.g., eyebrow motion). Our model is trained with flow matching, leading to expressive results and fast inference. In case of dyadic conversations, AV-Flow produces an always-on avatar, that actively listens and reacts to the audio-visual input of a user. Through extensive experiments, we show that our method outperforms prior work, synthesizing natural-looking 4D talking avatars. Project page: https://aggelinacha.github.io/AV-Flow/

AV-Flow: Transforming Text to Audio-Visual Human-like Interactions

TL;DR

AV-Flow presents a text-driven approach to synthesize synchronized speech and 4D facial/head motion for photo-realistic avatars using two interconnected diffusion transformers trained with flow matching. The method enables both monadic and empathetic dyadic interactions by conditioning on user audio-visual input and introducing cross-modal highway fusion to tightly couple speech and motion. It achieves state-of-the-art performance on lip-sync, realism, and audio-visual alignment while delivering fast inference and a text-to-tokens path for end-to-end text input. This work advances natural human-AI interaction by delivering an always-on, listening avatar capable of active response in conversations, with careful attention to ethical use and consent.

Abstract

We introduce AV-Flow, an audio-visual generative model that animates photo-realistic 4D talking avatars given only text input. In contrast to prior work that assumes an existing speech signal, we synthesize speech and vision jointly. We demonstrate human-like speech synthesis, synchronized lip motion, lively facial expressions and head pose; all generated from just text characters. The core premise of our approach lies in the architecture of our two parallel diffusion transformers. Intermediate highway connections ensure communication between the audio and visual modalities, and thus, synchronized speech intonation and facial dynamics (e.g., eyebrow motion). Our model is trained with flow matching, leading to expressive results and fast inference. In case of dyadic conversations, AV-Flow produces an always-on avatar, that actively listens and reacts to the audio-visual input of a user. Through extensive experiments, we show that our method outperforms prior work, synthesizing natural-looking 4D talking avatars. Project page: https://aggelinacha.github.io/AV-Flow/

Paper Structure

This paper contains 16 sections, 6 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: We present AV-Flow, a novel method for joint audio-visual generation of 4D talking avatars, given text input only (e.g. obtained from an LLM). Inter-connected diffusion transformers ensure cross-modal communication, synthesizing synchronized speech, facial motion, and head motion, based on the flow matching objective. AV-Flow further enables empathetic dyadic interactions, by animating an always-on avatar that actively listens and reacts to the audio-visual input of a user.
  • Figure 2: Overview of AV-Flow. Given any input text, our method synthesizes expressive audio-visual 4D talking avatars, jointly generating head and facial dynamics and the corresponding speech signal. Two parallel diffusion transformers with intermediate highway connections ensure communication between the audio and visual modalities. AV-Flow can be additionally conditioned on the audio-visual input of a user, in order to synthesize conversational avatars in dyadic interactions.
  • Figure 3: Qualitative Results of AV-Flow. From just raw text characters as input, AV-Flow synthesizes expressive audio signal (shown as mel-spectrogram on top) and corresponding head and facial dynamics of our 4D talking avatar.
  • Figure 4: Qualitative Evaluation. We compare with state-of-the-art methods for audio-driven talking faces, namely FaceTalk aneja2023facetalk, VASA-1 xu2024vasa, Audio2Photoreal ng2024audio2photoreal, and the text-driven TTSF Jang_2024_CVPR (the only one that can generate speech from text like ours). We re-implement VASA-1 and TTSF (denoted with an asterisk) for our data (face encodings and renderers). FaceTalk only animates the face (not head motion). Our proposed AV-Flow synthesizes the corresponding phoneme (shown on top) more accurately.
  • Figure 5: Audio-Visual Guidance in Dyadic Conversations. The actor reacts (with their gaze or smile) according to the participant's expression and/or voice (AV-Flow with guidance).
  • ...and 4 more figures