AV-EmoDialog: Chat with Audio-Visual Users Leveraging Emotional Cues
Se Jin Park, Yeonju Kim, Hyeongseop Rha, Bella Godiva, Yong Man Ro
TL;DR
AV-EmoDialog addresses emotion-aware dialogue by directly processing audio-visual inputs instead of text-only transcription. The approach uses a three-stage training scheme: a speech encoder for verbal cues, a face encoder for nonverbal cues, and an end-to-end LLM integration with LoRA, optimizing a dialogue objective $L_{dialog}$ over $R$ rounds. Experiments on MultiDialog show AV-EmoDialog achieves superior EmoBERT scores and semantic metrics, and scores highly on GPT-4 and human evaluations for fluency, empathy, and emotional context. This end-to-end audio-visual dialogue model demonstrates the importance of fusing audio and visual cues for more natural, emotionally resonant interactions, with potential extensions to diverse datasets and end-to-end speech generation.
Abstract
In human communication, both verbal and non-verbal cues play a crucial role in conveying emotions, intentions, and meaning beyond words alone. These non-linguistic information, such as facial expressions, eye contact, voice tone, and pitch, are fundamental elements of effective interactions, enriching conversations by adding emotional and contextual depth. Recognizing the importance of non-linguistic content in communication, we present AV-EmoDialog, a dialogue system designed to exploit verbal and non-verbal information from users' audio-visual inputs to generate more responsive and empathetic interactions. AV-EmoDialog systematically exploits the emotional cues in audio-visual dialogues; extracting speech content and emotional tones from speech, analyzing fine-grained facial expressions from visuals, and integrating these cues to generate emotionally aware responses in an end-to-end manner. Through extensive experiments, we validate that the proposed AV-EmoDialog outperforms existing multimodal LLMs in generating not only emotionally appropriate but also contextually appropriate responses.
