Table of Contents
Fetching ...

Milmer: a Framework for Multiple Instance Learning based Multimodal Emotion Recognition

Zaitian Wang, Jian He, Yu Liang, Xiyuan Hu, Tianhao Peng, Kaixin Wang, Jiakai Wang, Chenlong Zhang, Weili Zhang, Shuang Niu, Xiaoyang Xie

TL;DR

This work addresses multimodal emotion recognition by jointly leveraging EEG signals and facial expressions. It proposes Milmer, a transformer-based framework that fuses modalities through a cross-attention fusion module and employs Multiple Instance Learning (MIL) over sequences of facial frames, with a Swin Transformer backbone for visual feature extraction. Key contributions include a 1) MIL-based facial representation that selects top-K frames, 2) cross-attention-based token balancing to harmonize EEG and visual features, 3) a four-module architecture (EEG preprocessing, facial feature extraction with balancing, cross-modal fusion, and classification) and 4) extensive ablations and state-of-the-art results on the DEAP dataset, achieving 96.72% accuracy in four-class emotion recognition and releasing the code. The approach demonstrates the value of sophisticated fusion and temporal facial modeling for robust, real-world emotion understanding with potential impact in HCI and affective computing.

Abstract

Emotions play a crucial role in human behavior and decision-making, making emotion recognition a key area of interest in human-computer interaction (HCI). This study addresses the challenges of emotion recognition by integrating facial expression analysis with electroencephalogram (EEG) signals, introducing a novel multimodal framework-Milmer. The proposed framework employs a transformer-based fusion approach to effectively integrate visual and physiological modalities. It consists of an EEG preprocessing module, a facial feature extraction and balancing module, and a cross-modal fusion module. To enhance visual feature extraction, we fine-tune a pre-trained Swin Transformer on emotion-related datasets. Additionally, a cross-attention mechanism is introduced to balance token representation across modalities, ensuring effective feature integration. A key innovation of this work is the adoption of a multiple instance learning (MIL) approach, which extracts meaningful information from multiple facial expression images over time, capturing critical temporal dynamics often overlooked in previous studies. Extensive experiments conducted on the DEAP dataset demonstrate the superiority of the proposed framework, achieving a classification accuracy of 96.72% in the four-class emotion recognition task. Ablation studies further validate the contributions of each module, highlighting the significance of advanced feature extraction and fusion strategies in enhancing emotion recognition performance. Our code are available at https://github.com/liangyubuaa/Milmer.

Milmer: a Framework for Multiple Instance Learning based Multimodal Emotion Recognition

TL;DR

This work addresses multimodal emotion recognition by jointly leveraging EEG signals and facial expressions. It proposes Milmer, a transformer-based framework that fuses modalities through a cross-attention fusion module and employs Multiple Instance Learning (MIL) over sequences of facial frames, with a Swin Transformer backbone for visual feature extraction. Key contributions include a 1) MIL-based facial representation that selects top-K frames, 2) cross-attention-based token balancing to harmonize EEG and visual features, 3) a four-module architecture (EEG preprocessing, facial feature extraction with balancing, cross-modal fusion, and classification) and 4) extensive ablations and state-of-the-art results on the DEAP dataset, achieving 96.72% accuracy in four-class emotion recognition and releasing the code. The approach demonstrates the value of sophisticated fusion and temporal facial modeling for robust, real-world emotion understanding with potential impact in HCI and affective computing.

Abstract

Emotions play a crucial role in human behavior and decision-making, making emotion recognition a key area of interest in human-computer interaction (HCI). This study addresses the challenges of emotion recognition by integrating facial expression analysis with electroencephalogram (EEG) signals, introducing a novel multimodal framework-Milmer. The proposed framework employs a transformer-based fusion approach to effectively integrate visual and physiological modalities. It consists of an EEG preprocessing module, a facial feature extraction and balancing module, and a cross-modal fusion module. To enhance visual feature extraction, we fine-tune a pre-trained Swin Transformer on emotion-related datasets. Additionally, a cross-attention mechanism is introduced to balance token representation across modalities, ensuring effective feature integration. A key innovation of this work is the adoption of a multiple instance learning (MIL) approach, which extracts meaningful information from multiple facial expression images over time, capturing critical temporal dynamics often overlooked in previous studies. Extensive experiments conducted on the DEAP dataset demonstrate the superiority of the proposed framework, achieving a classification accuracy of 96.72% in the four-class emotion recognition task. Ablation studies further validate the contributions of each module, highlighting the significance of advanced feature extraction and fusion strategies in enhancing emotion recognition performance. Our code are available at https://github.com/liangyubuaa/Milmer.

Paper Structure

This paper contains 21 sections, 7 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The overview of our framework, consisting of a EEG preprocessing module, a facial feature extraction and balancing module, and a modality fusion module.
  • Figure 2: The image feature dimensionality reduction module in this work. A cross-attention mechanism is used to reduce the number of feature vectors output by Swin Transformer to the predefined number of learnable query tokens. Here, M represents the number of feature vectors output by Swin Transformer, D is the vector dimension, and N is the number of learnable query tokens.
  • Figure 3: Confusion matrices of subject-dependent and subject-independent experiments.