Table of Contents
Fetching ...

INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations

Yongming Zhu, Longhao Zhang, Zhengkun Rong, Tianshu Hu, Shuang Liang, Zhipeng Ge

TL;DR

The paper tackles the challenge of audio-driven head generation in dyadic conversations by enabling a driven agent to fluidly switch between speaking and listening without predefined roles. It introduces INFP, a two-stage framework: Motion-Based Head Imitation learns a disentangled motion latent space from real conversations to animate a portrait, and Audio-Guided Motion Generation maps dual-track dyadic audio to this latent space via an interactive motion guider and a lightweight conditional diffusion transformer. A new DyConv dataset of over 200 hours of authentic dyadic interaction is presented to support scalable training and evaluation. Empirical results across interactive, listening, and talking head tasks show significant improvements over state-of-the-art baselines in visual quality, lip-sync accuracy, and motion diversity, validated by both quantitative metrics and user studies. The work enables real-time, person-generic interactive head generation suitable for applications like video conferencing and provides a public dataset to spur further research in dyadic visual communication.

Abstract

Imagine having a conversation with a socially intelligent agent. It can attentively listen to your words and offer visual and linguistic feedback promptly. This seamless interaction allows for multiple rounds of conversation to flow smoothly and naturally. In pursuit of actualizing it, we propose INFP, a novel audio-driven head generation framework for dyadic interaction. Unlike previous head generation works that only focus on single-sided communication, or require manual role assignment and explicit role switching, our model drives the agent portrait dynamically alternates between speaking and listening state, guided by the input dyadic audio. Specifically, INFP comprises a Motion-Based Head Imitation stage and an Audio-Guided Motion Generation stage. The first stage learns to project facial communicative behaviors from real-life conversation videos into a low-dimensional motion latent space, and use the motion latent codes to animate a static image. The second stage learns the mapping from the input dyadic audio to motion latent codes through denoising, leading to the audio-driven head generation in interactive scenarios. To facilitate this line of research, we introduce DyConv, a large scale dataset of rich dyadic conversations collected from the Internet. Extensive experiments and visualizations demonstrate superior performance and effectiveness of our method. Project Page: https://grisoon.github.io/INFP/.

INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations

TL;DR

The paper tackles the challenge of audio-driven head generation in dyadic conversations by enabling a driven agent to fluidly switch between speaking and listening without predefined roles. It introduces INFP, a two-stage framework: Motion-Based Head Imitation learns a disentangled motion latent space from real conversations to animate a portrait, and Audio-Guided Motion Generation maps dual-track dyadic audio to this latent space via an interactive motion guider and a lightweight conditional diffusion transformer. A new DyConv dataset of over 200 hours of authentic dyadic interaction is presented to support scalable training and evaluation. Empirical results across interactive, listening, and talking head tasks show significant improvements over state-of-the-art baselines in visual quality, lip-sync accuracy, and motion diversity, validated by both quantitative metrics and user studies. The work enables real-time, person-generic interactive head generation suitable for applications like video conferencing and provides a public dataset to spur further research in dyadic visual communication.

Abstract

Imagine having a conversation with a socially intelligent agent. It can attentively listen to your words and offer visual and linguistic feedback promptly. This seamless interaction allows for multiple rounds of conversation to flow smoothly and naturally. In pursuit of actualizing it, we propose INFP, a novel audio-driven head generation framework for dyadic interaction. Unlike previous head generation works that only focus on single-sided communication, or require manual role assignment and explicit role switching, our model drives the agent portrait dynamically alternates between speaking and listening state, guided by the input dyadic audio. Specifically, INFP comprises a Motion-Based Head Imitation stage and an Audio-Guided Motion Generation stage. The first stage learns to project facial communicative behaviors from real-life conversation videos into a low-dimensional motion latent space, and use the motion latent codes to animate a static image. The second stage learns the mapping from the input dyadic audio to motion latent codes through denoising, leading to the audio-driven head generation in interactive scenarios. To facilitate this line of research, we introduce DyConv, a large scale dataset of rich dyadic conversations collected from the Internet. Extensive experiments and visualizations demonstrate superior performance and effectiveness of our method. Project Page: https://grisoon.github.io/INFP/.

Paper Structure

This paper contains 17 sections, 1 equation, 5 figures, 4 tables.

Figures (5)

  • Figure 1: We present INFP, an audio-driven interactive head generation framework for dyadic conversations. Given the dual-track audio in dyadic conversations and a single portrait image of arbitrary agent, our framework can dynamically synthesize verbal, non-verbal and interactive agent videos with lifelike facial expressions and rhythmic head pose movements. Additionally, our framework is lightweight yet powerful, making it practical in instant communication scenarios such as the video conferencing. INFP denotes our method is Interactive, Natural, Flash and Person-generic.
  • Figure 2: Objective illustration. Existing interactive head generation (left) applied manual role assignment and explicit role switching. Our proposed INFP (right) is a unified framework which can dynamically and naturally adapt to various conversational states.
  • Figure 3: Overview of INFP. The first stage (Motion-Based Head Imitation) learns to project facial communicative behaviors from real-life conversation videos into a low-dimensional motion latent space, and use the latent codes to animate a static image. The second stage (Audio-Guided Motion Generation) learns the mapping from the input dyadic audio to motion latent codes through denoising, to achieve the audio-driven interactive head generation.
  • Figure 4: Qualitative comparison of interactive head generation on DyConv. Given the input audio of dyadic conversations (the audio of the predicted agent is black and the audio of the conversation partner is blue), our framework generates reasonable and vivid facial and head pose movements as various roles in interactive scenarios.
  • Figure 5: Visualization of style control with our style modulation.