Multimodal Emotion Recognition and Sentiment Analysis in Multi-Party Conversation Contexts
Aref Farhadipour, Hossein Ranjbar, Masoumeh Chapariniya, Teodora Vukovic, Sarah Ebling, Volker Dellwo
TL;DR
The paper tackles emotion recognition and sentiment analysis in multi-party conversations by proposing a four-modality fusion framework that combines text, audio, facial cues, and video context. It leverages RoBERTa for text, Wav2Vec2 for speech, FacialNet for facial representations, and a CNN+Transformer-based video model, concatenating their embeddings into a multimodal vector for classification. On the MELD dataset, the four-modality fusion achieves 66.36% emotion accuracy and 72.15% sentiment accuracy, outperforming unimodal baselines. The work demonstrates the value of integrated linguistic, acoustic, and visual cues in realistic, noisy conversational settings and points to future directions in speaker-focused segmentation and arousal/valence-based fusion strategies.
Abstract
Emotion recognition and sentiment analysis are pivotal tasks in speech and language processing, particularly in real-world scenarios involving multi-party, conversational data. This paper presents a multimodal approach to tackle these challenges on a well-known dataset. We propose a system that integrates four key modalities/channels using pre-trained models: RoBERTa for text, Wav2Vec2 for speech, a proposed FacialNet for facial expressions, and a CNN+Transformer architecture trained from scratch for video analysis. Feature embeddings from each modality are concatenated to form a multimodal vector, which is then used to predict emotion and sentiment labels. The multimodal system demonstrates superior performance compared to unimodal approaches, achieving an accuracy of 66.36% for emotion recognition and 72.15% for sentiment analysis.
