Table of Contents
Fetching ...

Empathetic Response in Audio-Visual Conversations Using Emotion Preference Optimization and MambaCompressor

Yeonju Kim, Se Jin Park, Yong Man Ro

TL;DR

This work tackles empathetic response generation in audio-visual conversations under long histories. It introduces Emotional Preference Optimization (EPO), which trains on correct and counter-emotional responses to sharpen nuance sensitivity, and MambaCompressor to compress lengthy dialogue histories and reduce computational burden. A dedicated Audio-Visual Emotion Extractor guides emotion-aware responses, and a unified framework combines these components with a frozen LLM backbone fine-tuned via LoRA. Across AV dialogue datasets, the approach yields stronger semantic and emotional alignment and notable efficiency gains, indicating practical impact for scalable, empathetic dialogue in real-world settings.

Abstract

Chatbot research is advancing with the growing importance of chatbots in fields that require human interactions, such as customer support and mental health care. Despite these advancements, chatbots still face significant challenges in understanding subtle nuances and managing long conversation histories. To address these issues, our study introduces a dual approach: firstly, we employ Emotional Preference Optimization (EPO) to train chatbots not only with correct responses but also with counter-emotional responses-those that are contextually similar but emotionally divergent. This training enables the model to discern fine nuance distinctions between correct and counter-emotional responses, thereby enhancing the quality of its responses. Secondly, we introduce MambaCompressor to effectively compress and manage extensive conversation histories, significantly reducing time and memory complexities while improving the chatbot's contextual understanding. Our comprehensive experiments across multiple datasets demonstrate that our model significantly outperforms existing models in generating empathetic responses and efficiently managing lengthy dialogues.

Empathetic Response in Audio-Visual Conversations Using Emotion Preference Optimization and MambaCompressor

TL;DR

This work tackles empathetic response generation in audio-visual conversations under long histories. It introduces Emotional Preference Optimization (EPO), which trains on correct and counter-emotional responses to sharpen nuance sensitivity, and MambaCompressor to compress lengthy dialogue histories and reduce computational burden. A dedicated Audio-Visual Emotion Extractor guides emotion-aware responses, and a unified framework combines these components with a frozen LLM backbone fine-tuned via LoRA. Across AV dialogue datasets, the approach yields stronger semantic and emotional alignment and notable efficiency gains, indicating practical impact for scalable, empathetic dialogue in real-world settings.

Abstract

Chatbot research is advancing with the growing importance of chatbots in fields that require human interactions, such as customer support and mental health care. Despite these advancements, chatbots still face significant challenges in understanding subtle nuances and managing long conversation histories. To address these issues, our study introduces a dual approach: firstly, we employ Emotional Preference Optimization (EPO) to train chatbots not only with correct responses but also with counter-emotional responses-those that are contextually similar but emotionally divergent. This training enables the model to discern fine nuance distinctions between correct and counter-emotional responses, thereby enhancing the quality of its responses. Secondly, we introduce MambaCompressor to effectively compress and manage extensive conversation histories, significantly reducing time and memory complexities while improving the chatbot's contextual understanding. Our comprehensive experiments across multiple datasets demonstrate that our model significantly outperforms existing models in generating empathetic responses and efficiently managing lengthy dialogues.

Paper Structure

This paper contains 30 sections, 2 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Overview of the proposed method. The model comprises an LLM backbone, an audio-visual emotion extractor, and MambaCompressor. The audio-visual emotion extractor extracts emotional information from the audio and video, while MambaCompressor summarizes the conversation history. These processed inputs, along with text inputs, are then fed into the LLM, which generates the response.
  • Figure 2: Audio-Visual Emotion Extractor. Audio-Visual Emotion Extractor is trained to extract emotion-related features from audio and video, with the goal of predicting the emotion category. We keep the LLM frozen and train the Q-former together with learnable queries.
  • Figure 3: Illustration of the training process for the MambaCompressor. We trained the MambaCompressor with a frozen LLM on a conversation reconstruction task.
  • Figure 4: Counter-Emotional Response Generation. We categorize different emotional situations using brackets, and then generate responses based on these emotions. Subsequently, we calculate the emotional similarity between the ground truth and the generated response to select the counter-emotional response.
  • Figure 5: Human Evaluation. Results of the human evaluation comparing different models on two metrics: Semantics Scores and Empathy Scores. Higher scores indicate better performance.
  • ...and 6 more figures