A Simple Attention-Based Mechanism for Bimodal Emotion Classification
Mazen Elabd, Sardar Jaf
TL;DR
The paper tackles automatic emotion classification by combining text and audio modalities using an attention-based bimodal architecture. It leverages BERT for text and Audio Spectrogram Transformer for speech, fusing their last-hidden representations through multi-head cross-attention followed by self-attention, achieving state-of-the-art results on the MELD dataset with a weighted F1 of approximately 0.697. Through extensive per-class analysis and error inspection, the work demonstrates that multimodal fusion outperforms unimodal and baseline fusion approaches, underscoring the value of cross-modal interactions. The methods offer a scalable and extensible framework for multimodal emotion recognition with potential extensions to additional modalities such as video.
Abstract
Big data contain rich information for machine learning algorithms to utilize when learning important features during classification tasks. Human beings express their emotion using certain words, speech (tone, pitch, speed) or facial expression. Artificial Intelligence approach to emotion classification are largely based on learning from textual information. However, public datasets containing text and speech data provide sufficient resources to train machine learning algorithms for the tack of emotion classification. In this paper, we present novel bimodal deep learning-based architectures enhanced with attention mechanism trained and tested on text and speech data for emotion classification. We report details of different deep learning based architectures and show the performance of each architecture including rigorous error analyses. Our finding suggests that deep learning based architectures trained on different types of data (text and speech) outperform architectures trained only on text or speech. Our proposed attention-based bimodal architecture outperforms several state-of-the-art systems in emotion classification.
