Table of Contents
Fetching ...

Towards Multimodal Empathetic Response Generation: A Rich Text-Speech-Vision Avatar-based Benchmark

Han Zhang, Zixiang Meng, Meng Luo, Hong Han, Lizi Liao, Erik Cambria, Hao Fei

TL;DR

ERG has largely relied on text; this work extends empathetic dialogue to multimodal MERG using an avatar-based benchmark AvaMERG and an end-to-end system, Empatheia. AvaMERG provides 33,048 dialogues and 152,021 multimodal utterances across text, speech, and talking-head video, with diverse avatar profiles and topics. Empatheia employs a Multimodal Encoder, a Vicuna-based core reasoner with Chain-of-Empathy, and two generators (StyleTTS2 and DreamTalk), together with Chain-of-Empathy, Content Consistency Learning, and Style Alignment and Consistency Learning for end-to-end training; experiments show superior text ERG and MERG performance versus baselines. This work pioneers MERG by delivering a realistic benchmark and an integrated model, enabling more natural and emotionally coherent multimodal human-AI interactions.

Abstract

Empathetic Response Generation (ERG) is one of the key tasks of the affective computing area, which aims to produce emotionally nuanced and compassionate responses to user's queries. However, existing ERG research is predominantly confined to the singleton text modality, limiting its effectiveness since human emotions are inherently conveyed through multiple modalities. To combat this, we introduce an avatar-based Multimodal ERG (MERG) task, entailing rich text, speech, and facial vision information. We first present a large-scale high-quality benchmark dataset, \textbf{AvaMERG}, which extends traditional text ERG by incorporating authentic human speech audio and dynamic talking-face avatar videos, encompassing a diverse range of avatar profiles and broadly covering various topics of real-world scenarios. Further, we deliberately tailor a system, named \textbf{Empatheia}, for MERG. Built upon a Multimodal Large Language Model (MLLM) with multimodal encoder, speech and avatar generators, Empatheia performs end-to-end MERG, with Chain-of-Empathetic reasoning mechanism integrated for enhanced empathy understanding and reasoning. Finally, we devise a list of empathetic-enhanced tuning strategies, strengthening the capabilities of emotional accuracy and content, avatar-profile consistency across modalities. Experimental results on AvaMERG data demonstrate that Empatheia consistently shows superior performance than baseline methods on both textual ERG and MERG. Overall, this work is expected to pioneer the MERG research by introducing a novel benchmark and an end-to-end model, laying a solid foundation for future advancements in multimodal empathetic response generation.

Towards Multimodal Empathetic Response Generation: A Rich Text-Speech-Vision Avatar-based Benchmark

TL;DR

ERG has largely relied on text; this work extends empathetic dialogue to multimodal MERG using an avatar-based benchmark AvaMERG and an end-to-end system, Empatheia. AvaMERG provides 33,048 dialogues and 152,021 multimodal utterances across text, speech, and talking-head video, with diverse avatar profiles and topics. Empatheia employs a Multimodal Encoder, a Vicuna-based core reasoner with Chain-of-Empathy, and two generators (StyleTTS2 and DreamTalk), together with Chain-of-Empathy, Content Consistency Learning, and Style Alignment and Consistency Learning for end-to-end training; experiments show superior text ERG and MERG performance versus baselines. This work pioneers MERG by delivering a realistic benchmark and an integrated model, enabling more natural and emotionally coherent multimodal human-AI interactions.

Abstract

Empathetic Response Generation (ERG) is one of the key tasks of the affective computing area, which aims to produce emotionally nuanced and compassionate responses to user's queries. However, existing ERG research is predominantly confined to the singleton text modality, limiting its effectiveness since human emotions are inherently conveyed through multiple modalities. To combat this, we introduce an avatar-based Multimodal ERG (MERG) task, entailing rich text, speech, and facial vision information. We first present a large-scale high-quality benchmark dataset, \textbf{AvaMERG}, which extends traditional text ERG by incorporating authentic human speech audio and dynamic talking-face avatar videos, encompassing a diverse range of avatar profiles and broadly covering various topics of real-world scenarios. Further, we deliberately tailor a system, named \textbf{Empatheia}, for MERG. Built upon a Multimodal Large Language Model (MLLM) with multimodal encoder, speech and avatar generators, Empatheia performs end-to-end MERG, with Chain-of-Empathetic reasoning mechanism integrated for enhanced empathy understanding and reasoning. Finally, we devise a list of empathetic-enhanced tuning strategies, strengthening the capabilities of emotional accuracy and content, avatar-profile consistency across modalities. Experimental results on AvaMERG data demonstrate that Empatheia consistently shows superior performance than baseline methods on both textual ERG and MERG. Overall, this work is expected to pioneer the MERG research by introducing a novel benchmark and an end-to-end model, laying a solid foundation for future advancements in multimodal empathetic response generation.

Paper Structure

This paper contains 49 sections, 32 equations, 24 figures, 9 tables.

Figures (24)

  • Figure 1: A snippet of avatar-based Multimodal Empathetic Response Generation (MERG) with rich multimodal signals: text (dialogue), audio (acoustic speech) and vision (dynamic talking-head avatar).
  • Figure 2: Visualized statistics of AvaMERG dataset.
  • Figure 3: Architecture of our Empatheia MLLM for MERG.
  • Figure 4: Illustration of the Content Synchronizer and Style Disentangle modules.
  • Figure 5: Illustrations of the proposed training strategies.
  • ...and 19 more figures