ICANet: A Method of Short Video Emotion Recognition Driven by Multimodal Data
Xuecheng Wu, Mengmeng Tian, Lanhang Zhai
TL;DR
This work tackles emotion recognition in short videos, where single-modality signals struggle due to camouflage and overlap among emotions. It introduces ICANet, a three-branch architecture that processes RGB video, optical flow, and LFCC-based audio with separate feature extractors (I3D two-stream for visuals and CA-VGG16 with Coordinate Attention for audio) and fuses their predictions at the decision level. The model achieves 80.77% accuracy on the IEMOCAP dataset, with an optimal fusion ratio of RGB:FLOW:Audio = $4:2:4$, surpassing current CNN-based state-of-the-art methods by 15.89 percentage points. Overall, the paper demonstrates that robust short-video emotion recognition benefits substantially from deliberate multimodal integration and tailored feature extractors for each modality, offering a practical framework for multimodal HCI applications.
Abstract
With the fast development of artificial intelligence and short videos, emotion recognition in short videos has become one of the most important research topics in human-computer interaction. At present, most emotion recognition methods still stay in a single modality. However, in daily life, human beings will usually disguise their real emotions, which leads to the problem that the accuracy of single modal emotion recognition is relatively terrible. Moreover, it is not easy to distinguish similar emotions. Therefore, we propose a new approach denoted as ICANet to achieve multimodal short video emotion recognition by employing three different modalities of audio, video and optical flow, making up for the lack of a single modality and then improving the accuracy of emotion recognition in short videos. ICANet has a better accuracy of 80.77% on the IEMOCAP benchmark, exceeding the SOTA methods by 15.89%.
