Warm Chat: Diffuse Emotion-aware Interactive Talking Head Avatar with Tree-Structured Guidance

Haijie Yang; Zhenyu Zhang; Hao Tang; Jianjun Qian; Jian Yang

Warm Chat: Diffuse Emotion-aware Interactive Talking Head Avatar with Tree-Structured Guidance

Haijie Yang, Zhenyu Zhang, Hao Tang, Jianjun Qian, Jian Yang

TL;DR

Warm Chat tackles the lack of emotion-aware bidirectional talking heads by integrating a diffusion-based generator with an Interactive Talking Tree and a Transformer-based head-mask generator. It uses GPT-4 to generate open-ended dialogue and a fine-tuned Audio-to-Expression model for speaker motion, along with a Listener Emotion Expression Dictionary for the listener. The method supports seamless speaker/listener role switching and accumulative emotional context via reverse-level traversal, producing temporally coherent, emotionally expressive avatars. Experiments on ViCo and ViCoX demonstrate improved visual quality, lip-sync, and emotional coherence compared to baselines, highlighting potential for natural, long-running avatar conversations.

Abstract

Generative models have advanced rapidly, enabling impressive talking head generation that brings AI to life. However, most existing methods focus solely on one-way portrait animation. Even the few that support bidirectional conversational interactions lack precise emotion-adaptive capabilities, significantly limiting their practical applicability. In this paper, we propose Warm Chat, a novel emotion-aware talking head generation framework for dyadic interactions. Leveraging the dialogue generation capability of large language models (LLMs, e.g., GPT-4), our method produces temporally consistent virtual avatars with rich emotional variations that seamlessly transition between speaking and listening states. Specifically, we design a Transformer-based head mask generator that learns temporally consistent motion features in a latent mask space, capable of generating arbitrary-length, temporally consistent mask sequences to constrain head motions. Furthermore, we introduce an interactive talking tree structure to represent dialogue state transitions, where each tree node contains information such as child/parent/sibling nodes and the current character's emotional state. By performing reverse-level traversal, we extract rich historical emotional cues from the current node to guide expression synthesis. Extensive experiments demonstrate the superior performance and effectiveness of our method.

Warm Chat: Diffuse Emotion-aware Interactive Talking Head Avatar with Tree-Structured Guidance

TL;DR

Abstract

Warm Chat: Diffuse Emotion-aware Interactive Talking Head Avatar with Tree-Structured Guidance

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)