Table of Contents
Fetching ...

Feature-Based Dual Visual Feature Extraction Model for Compound Multimodal Emotion Recognition

Ran Liu, Fengyu Zhang, Cong Yu, Longjiang Yang, Zhuofan Wen, Siyuan Zhang, Hailiang Yao, Shun Chen, Zheng Lian, Bin Liu

TL;DR

This work tackles compound emotion recognition in unconstrained environments by proposing a multimodal framework that fuses visual representations from Vision Transformer (ViT) and ResNet50, along with audio and text cues. It introduces a feature-based pipeline with a Temporal Convolutional Network and a co-attention fusion mechanism to generate frame-level embeddings that are aggregated for video-level predictions. Evaluations on C-EXPR-DB and MELD demonstrate that the ViT-ResNet visual fusion, when integrated with temporal and cross-modal modeling, yields robust performance in scenarios with complex cues. The approach offers a practical, end-to-end solution for reliable multimodal compound emotion recognition in real-world settings, with open-source code available at the linked GitHub repository.

Abstract

This article presents our results for the eighth Affective Behavior Analysis in-the-wild (ABAW) competition.Multimodal emotion recognition (ER) has important applications in affective computing and human-computer interaction. However, in the real world, compound emotion recognition faces greater issues of uncertainty and modal conflicts. For the Compound Expression (CE) Recognition Challenge,this paper proposes a multimodal emotion recognition method that fuses the features of Vision Transformer (ViT) and Residual Network (ResNet). We conducted experiments on the C-EXPR-DB and MELD datasets. The results show that in scenarios with complex visual and audio cues (such as C-EXPR-DB), the model that fuses the features of ViT and ResNet exhibits superior performance.Our code are avalible on https://github.com/MyGitHub-ax/8th_ABAW

Feature-Based Dual Visual Feature Extraction Model for Compound Multimodal Emotion Recognition

TL;DR

This work tackles compound emotion recognition in unconstrained environments by proposing a multimodal framework that fuses visual representations from Vision Transformer (ViT) and ResNet50, along with audio and text cues. It introduces a feature-based pipeline with a Temporal Convolutional Network and a co-attention fusion mechanism to generate frame-level embeddings that are aggregated for video-level predictions. Evaluations on C-EXPR-DB and MELD demonstrate that the ViT-ResNet visual fusion, when integrated with temporal and cross-modal modeling, yields robust performance in scenarios with complex cues. The approach offers a practical, end-to-end solution for reliable multimodal compound emotion recognition in real-world settings, with open-source code available at the linked GitHub repository.

Abstract

This article presents our results for the eighth Affective Behavior Analysis in-the-wild (ABAW) competition.Multimodal emotion recognition (ER) has important applications in affective computing and human-computer interaction. However, in the real world, compound emotion recognition faces greater issues of uncertainty and modal conflicts. For the Compound Expression (CE) Recognition Challenge,this paper proposes a multimodal emotion recognition method that fuses the features of Vision Transformer (ViT) and Residual Network (ResNet). We conducted experiments on the C-EXPR-DB and MELD datasets. The results show that in scenarios with complex visual and audio cues (such as C-EXPR-DB), the model that fuses the features of ViT and ResNet exhibits superior performance.Our code are avalible on https://github.com/MyGitHub-ax/8th_ABAW

Paper Structure

This paper contains 9 sections, 2 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Feature-based Modeling.