Table of Contents
Fetching ...

ICANet: A Method of Short Video Emotion Recognition Driven by Multimodal Data

Xuecheng Wu, Mengmeng Tian, Lanhang Zhai

TL;DR

This work tackles emotion recognition in short videos, where single-modality signals struggle due to camouflage and overlap among emotions. It introduces ICANet, a three-branch architecture that processes RGB video, optical flow, and LFCC-based audio with separate feature extractors (I3D two-stream for visuals and CA-VGG16 with Coordinate Attention for audio) and fuses their predictions at the decision level. The model achieves 80.77% accuracy on the IEMOCAP dataset, with an optimal fusion ratio of RGB:FLOW:Audio = $4:2:4$, surpassing current CNN-based state-of-the-art methods by 15.89 percentage points. Overall, the paper demonstrates that robust short-video emotion recognition benefits substantially from deliberate multimodal integration and tailored feature extractors for each modality, offering a practical framework for multimodal HCI applications.

Abstract

With the fast development of artificial intelligence and short videos, emotion recognition in short videos has become one of the most important research topics in human-computer interaction. At present, most emotion recognition methods still stay in a single modality. However, in daily life, human beings will usually disguise their real emotions, which leads to the problem that the accuracy of single modal emotion recognition is relatively terrible. Moreover, it is not easy to distinguish similar emotions. Therefore, we propose a new approach denoted as ICANet to achieve multimodal short video emotion recognition by employing three different modalities of audio, video and optical flow, making up for the lack of a single modality and then improving the accuracy of emotion recognition in short videos. ICANet has a better accuracy of 80.77% on the IEMOCAP benchmark, exceeding the SOTA methods by 15.89%.

ICANet: A Method of Short Video Emotion Recognition Driven by Multimodal Data

TL;DR

This work tackles emotion recognition in short videos, where single-modality signals struggle due to camouflage and overlap among emotions. It introduces ICANet, a three-branch architecture that processes RGB video, optical flow, and LFCC-based audio with separate feature extractors (I3D two-stream for visuals and CA-VGG16 with Coordinate Attention for audio) and fuses their predictions at the decision level. The model achieves 80.77% accuracy on the IEMOCAP dataset, with an optimal fusion ratio of RGB:FLOW:Audio = , surpassing current CNN-based state-of-the-art methods by 15.89 percentage points. Overall, the paper demonstrates that robust short-video emotion recognition benefits substantially from deliberate multimodal integration and tailored feature extractors for each modality, offering a practical framework for multimodal HCI applications.

Abstract

With the fast development of artificial intelligence and short videos, emotion recognition in short videos has become one of the most important research topics in human-computer interaction. At present, most emotion recognition methods still stay in a single modality. However, in daily life, human beings will usually disguise their real emotions, which leads to the problem that the accuracy of single modal emotion recognition is relatively terrible. Moreover, it is not easy to distinguish similar emotions. Therefore, we propose a new approach denoted as ICANet to achieve multimodal short video emotion recognition by employing three different modalities of audio, video and optical flow, making up for the lack of a single modality and then improving the accuracy of emotion recognition in short videos. ICANet has a better accuracy of 80.77% on the IEMOCAP benchmark, exceeding the SOTA methods by 15.89%.
Paper Structure (16 sections, 5 equations, 5 figures, 2 tables)

This paper contains 16 sections, 5 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The overall illustration of ICANet. It consists of the Data Preprocessing Module, the Multimodal Feature Extraction Module, and the Fusion Classification Module. Specifically, we fuse the three different feature tensors of feature extraction networks in the decision level feature fusion module.
  • Figure 2: The overall network structure of Inflated Inception-V1. Here, "Rec.Fields" represents the receptive fields for specific feature tensors.
  • Figure 3: The overall illustration of initial space sub module "Inc.". The strides of convolution and pooling operators are 1, which are not specificed.
  • Figure 4: The overall network structure of CA-VGG16. Specifically, there are five convolution transformation blocks.
  • Figure 5: The ablation experiments of the specific audio branch of ICANet in terms of ACC(%) on the IEMOCAP dataset.