Table of Contents
Fetching ...

Warm Chat: Diffuse Emotion-aware Interactive Talking Head Avatar with Tree-Structured Guidance

Haijie Yang, Zhenyu Zhang, Hao Tang, Jianjun Qian, Jian Yang

TL;DR

Warm Chat tackles the lack of emotion-aware bidirectional talking heads by integrating a diffusion-based generator with an Interactive Talking Tree and a Transformer-based head-mask generator. It uses GPT-4 to generate open-ended dialogue and a fine-tuned Audio-to-Expression model for speaker motion, along with a Listener Emotion Expression Dictionary for the listener. The method supports seamless speaker/listener role switching and accumulative emotional context via reverse-level traversal, producing temporally coherent, emotionally expressive avatars. Experiments on ViCo and ViCoX demonstrate improved visual quality, lip-sync, and emotional coherence compared to baselines, highlighting potential for natural, long-running avatar conversations.

Abstract

Generative models have advanced rapidly, enabling impressive talking head generation that brings AI to life. However, most existing methods focus solely on one-way portrait animation. Even the few that support bidirectional conversational interactions lack precise emotion-adaptive capabilities, significantly limiting their practical applicability. In this paper, we propose Warm Chat, a novel emotion-aware talking head generation framework for dyadic interactions. Leveraging the dialogue generation capability of large language models (LLMs, e.g., GPT-4), our method produces temporally consistent virtual avatars with rich emotional variations that seamlessly transition between speaking and listening states. Specifically, we design a Transformer-based head mask generator that learns temporally consistent motion features in a latent mask space, capable of generating arbitrary-length, temporally consistent mask sequences to constrain head motions. Furthermore, we introduce an interactive talking tree structure to represent dialogue state transitions, where each tree node contains information such as child/parent/sibling nodes and the current character's emotional state. By performing reverse-level traversal, we extract rich historical emotional cues from the current node to guide expression synthesis. Extensive experiments demonstrate the superior performance and effectiveness of our method.

Warm Chat: Diffuse Emotion-aware Interactive Talking Head Avatar with Tree-Structured Guidance

TL;DR

Warm Chat tackles the lack of emotion-aware bidirectional talking heads by integrating a diffusion-based generator with an Interactive Talking Tree and a Transformer-based head-mask generator. It uses GPT-4 to generate open-ended dialogue and a fine-tuned Audio-to-Expression model for speaker motion, along with a Listener Emotion Expression Dictionary for the listener. The method supports seamless speaker/listener role switching and accumulative emotional context via reverse-level traversal, producing temporally coherent, emotionally expressive avatars. Experiments on ViCo and ViCoX demonstrate improved visual quality, lip-sync, and emotional coherence compared to baselines, highlighting potential for natural, long-running avatar conversations.

Abstract

Generative models have advanced rapidly, enabling impressive talking head generation that brings AI to life. However, most existing methods focus solely on one-way portrait animation. Even the few that support bidirectional conversational interactions lack precise emotion-adaptive capabilities, significantly limiting their practical applicability. In this paper, we propose Warm Chat, a novel emotion-aware talking head generation framework for dyadic interactions. Leveraging the dialogue generation capability of large language models (LLMs, e.g., GPT-4), our method produces temporally consistent virtual avatars with rich emotional variations that seamlessly transition between speaking and listening states. Specifically, we design a Transformer-based head mask generator that learns temporally consistent motion features in a latent mask space, capable of generating arbitrary-length, temporally consistent mask sequences to constrain head motions. Furthermore, we introduce an interactive talking tree structure to represent dialogue state transitions, where each tree node contains information such as child/parent/sibling nodes and the current character's emotional state. By performing reverse-level traversal, we extract rich historical emotional cues from the current node to guide expression synthesis. Extensive experiments demonstrate the superior performance and effectiveness of our method.

Paper Structure

This paper contains 14 sections, 12 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview. First, we construct an Interactive Talking Tree (ITT) to represent the dynamic states of the dialogue throughout the modeling process. By performing a reverse hierarchical traversal with weighted operations on the ITT, we derive the cumulative emotional label at the current node. Additionally, we fine-tune the pre-trained audio-to-expression model to obtain the speaker's facial motion. Next, we introduce a Consistent Random Head Mask Generator (CRHMG) to regulate the head motion of both the speaker and the listener. We also develop a Listener Emotion-Expression Dictionary (LEED) that maps emotional labels to plausible facial expressions, which are then translated into corresponding facial motion. Finally, we design a diffusion model conditioned on both facial and head motion to generate realistic responses for the speaker and listener. Leveraging a large language model (GPT-4), the system supports continuous, open-ended dialogue between the two parties.
  • Figure 2: The architecture diagram of the Consistent Random Head Mask Generator.
  • Figure 3: The architecture of the Interactive Talking Tree.
  • Figure 4: Our method demonstrates superior performance compared to other approaches across three scenarios: talking sequence reconstruction (a), human-virtual agent interaction (b), and virtual agent-virtual agent dialogue (c). The results show our method achieves better image quality, audio-lip synchronization, and emotional expressiveness. Notably, while Sonic and DiffusionRig were not originally designed for interactive dialogue scenarios, we manually implemented speaker-listener role switching for comparative evaluation.
  • Figure 5: Results of diverse generated head motions.