Empathetic Response in Audio-Visual Conversations Using Emotion Preference Optimization and MambaCompressor
Yeonju Kim, Se Jin Park, Yong Man Ro
TL;DR
This work tackles empathetic response generation in audio-visual conversations under long histories. It introduces Emotional Preference Optimization (EPO), which trains on correct and counter-emotional responses to sharpen nuance sensitivity, and MambaCompressor to compress lengthy dialogue histories and reduce computational burden. A dedicated Audio-Visual Emotion Extractor guides emotion-aware responses, and a unified framework combines these components with a frozen LLM backbone fine-tuned via LoRA. Across AV dialogue datasets, the approach yields stronger semantic and emotional alignment and notable efficiency gains, indicating practical impact for scalable, empathetic dialogue in real-world settings.
Abstract
Chatbot research is advancing with the growing importance of chatbots in fields that require human interactions, such as customer support and mental health care. Despite these advancements, chatbots still face significant challenges in understanding subtle nuances and managing long conversation histories. To address these issues, our study introduces a dual approach: firstly, we employ Emotional Preference Optimization (EPO) to train chatbots not only with correct responses but also with counter-emotional responses-those that are contextually similar but emotionally divergent. This training enables the model to discern fine nuance distinctions between correct and counter-emotional responses, thereby enhancing the quality of its responses. Secondly, we introduce MambaCompressor to effectively compress and manage extensive conversation histories, significantly reducing time and memory complexities while improving the chatbot's contextual understanding. Our comprehensive experiments across multiple datasets demonstrate that our model significantly outperforms existing models in generating empathetic responses and efficiently managing lengthy dialogues.
