Table of Contents
Fetching ...

AV-EmoDialog: Chat with Audio-Visual Users Leveraging Emotional Cues

Se Jin Park, Yeonju Kim, Hyeongseop Rha, Bella Godiva, Yong Man Ro

TL;DR

AV-EmoDialog addresses emotion-aware dialogue by directly processing audio-visual inputs instead of text-only transcription. The approach uses a three-stage training scheme: a speech encoder for verbal cues, a face encoder for nonverbal cues, and an end-to-end LLM integration with LoRA, optimizing a dialogue objective $L_{dialog}$ over $R$ rounds. Experiments on MultiDialog show AV-EmoDialog achieves superior EmoBERT scores and semantic metrics, and scores highly on GPT-4 and human evaluations for fluency, empathy, and emotional context. This end-to-end audio-visual dialogue model demonstrates the importance of fusing audio and visual cues for more natural, emotionally resonant interactions, with potential extensions to diverse datasets and end-to-end speech generation.

Abstract

In human communication, both verbal and non-verbal cues play a crucial role in conveying emotions, intentions, and meaning beyond words alone. These non-linguistic information, such as facial expressions, eye contact, voice tone, and pitch, are fundamental elements of effective interactions, enriching conversations by adding emotional and contextual depth. Recognizing the importance of non-linguistic content in communication, we present AV-EmoDialog, a dialogue system designed to exploit verbal and non-verbal information from users' audio-visual inputs to generate more responsive and empathetic interactions. AV-EmoDialog systematically exploits the emotional cues in audio-visual dialogues; extracting speech content and emotional tones from speech, analyzing fine-grained facial expressions from visuals, and integrating these cues to generate emotionally aware responses in an end-to-end manner. Through extensive experiments, we validate that the proposed AV-EmoDialog outperforms existing multimodal LLMs in generating not only emotionally appropriate but also contextually appropriate responses.

AV-EmoDialog: Chat with Audio-Visual Users Leveraging Emotional Cues

TL;DR

AV-EmoDialog addresses emotion-aware dialogue by directly processing audio-visual inputs instead of text-only transcription. The approach uses a three-stage training scheme: a speech encoder for verbal cues, a face encoder for nonverbal cues, and an end-to-end LLM integration with LoRA, optimizing a dialogue objective over rounds. Experiments on MultiDialog show AV-EmoDialog achieves superior EmoBERT scores and semantic metrics, and scores highly on GPT-4 and human evaluations for fluency, empathy, and emotional context. This end-to-end audio-visual dialogue model demonstrates the importance of fusing audio and visual cues for more natural, emotionally resonant interactions, with potential extensions to diverse datasets and end-to-end speech generation.

Abstract

In human communication, both verbal and non-verbal cues play a crucial role in conveying emotions, intentions, and meaning beyond words alone. These non-linguistic information, such as facial expressions, eye contact, voice tone, and pitch, are fundamental elements of effective interactions, enriching conversations by adding emotional and contextual depth. Recognizing the importance of non-linguistic content in communication, we present AV-EmoDialog, a dialogue system designed to exploit verbal and non-verbal information from users' audio-visual inputs to generate more responsive and empathetic interactions. AV-EmoDialog systematically exploits the emotional cues in audio-visual dialogues; extracting speech content and emotional tones from speech, analyzing fine-grained facial expressions from visuals, and integrating these cues to generate emotionally aware responses in an end-to-end manner. Through extensive experiments, we validate that the proposed AV-EmoDialog outperforms existing multimodal LLMs in generating not only emotionally appropriate but also contextually appropriate responses.

Paper Structure

This paper contains 26 sections, 1 equation, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Overview of the proposed method where (a) the speech encoder is trained to extract verbal cues that can be understood by the LLM, (b) the face encoder is trained to extract nonverbal cues through the LLM, and (c) the LLM fine-tuned to integrate the verbal and nonverbal cues from the respective encoders to generate contextually and emotionally appropriate response.
  • Figure 2: Hard prompts given at each stage of training. (a) speech understanding prompt enables the speech encoder to extract the verbal cues (both linguistic and non-linguistic information) from the audio input. (b) face video understanding prompt allows the face encoder to extract the nonverbal cues in the face video input. (c) audio-visual dialogue prompt guides the AV-EmoDialog to generate contextually and emotionally appropriate responses given the speech embedding and the visual embedding of the audio-visual user input. The metadata description is adjusted based on what extra information is available in the dataset such as emotion, emotion intensity, emotion description, age, and ethnicity.
  • Figure 3: Examples of emotion-relevant descriptions from face videos generated by GPT-4. The annotated descriptions provide detailed facial expressions and their dynamic changes over time, guiding the model to learn subtle and non-static emotion during the conversation
  • Figure 4: Generation results from our AV-EmoDialog, which processes audio-visual input from users and outputs textual responses. For clarity, the transcriptions of the audio-visual inputs are provided alongside. Below, we compare with those generated by comparison methods.
  • Figure 5: Facial descriptions generated for training using GPT.
  • ...and 10 more figures