Table of Contents
Fetching ...

CMCRD: Cross-Modal Contrastive Representation Distillation for Emotion Recognition

Siyuan Kan, Huanyu Wu, Zhenyao Cui, Fan Huang, Xiaolong Xu, Dongrui Wu

TL;DR

CMCRD addresses the challenge of improving emotion recognition when only a single modality is available at test by distilling knowledge from an EM teacher into an EEG student. It introduces a cross-modal contrastive representation distillation framework that uses minimum class confusion for teacher training and a mutual-information–based CMCRD loss for the student, enhanced by sampling weights derived from teacher prediction entropy. Evaluated on SEED, SEED-IV, and SEED-V with three backbone networks, CMCRD achieves notable gains over EEG-only baselines (average around $6.2\%$) and outperforms other distillation methods, while reducing hardware and data collection requirements at deployment. The results validate cross-modal distillation’s practical potential and suggest extensions to regression tasks, semi-supervised settings, and domain adaptation.

Abstract

Emotion recognition is an important component of affective computing, and also human-machine interaction. Unimodal emotion recognition is convenient, but the accuracy may not be high enough; on the contrary, multi-modal emotion recognition may be more accurate, but it also increases the complexity and cost of the data collection system. This paper considers cross-modal emotion recognition, i.e., using both electroencephalography (EEG) and eye movement in training, but only EEG or eye movement in test. We propose cross-modal contrastive representation distillation (CMCRD), which uses a pre-trained eye movement classification model to assist the training of an EEG classification model, improving feature extraction from EEG signals, or vice versa. During test, only EEG signals (or eye movement signals) are acquired, eliminating the need for multi-modal data. CMCRD not only improves the emotion recognition accuracy, but also makes the system more simplified and practical. Experiments using three different neural network architectures on three multi-modal emotion recognition datasets demonstrated the effectiveness of CMCRD. Compared with the EEG-only model, it improved the average classification accuracy by about 6.2%.

CMCRD: Cross-Modal Contrastive Representation Distillation for Emotion Recognition

TL;DR

CMCRD addresses the challenge of improving emotion recognition when only a single modality is available at test by distilling knowledge from an EM teacher into an EEG student. It introduces a cross-modal contrastive representation distillation framework that uses minimum class confusion for teacher training and a mutual-information–based CMCRD loss for the student, enhanced by sampling weights derived from teacher prediction entropy. Evaluated on SEED, SEED-IV, and SEED-V with three backbone networks, CMCRD achieves notable gains over EEG-only baselines (average around ) and outperforms other distillation methods, while reducing hardware and data collection requirements at deployment. The results validate cross-modal distillation’s practical potential and suggest extensions to regression tasks, semi-supervised settings, and domain adaptation.

Abstract

Emotion recognition is an important component of affective computing, and also human-machine interaction. Unimodal emotion recognition is convenient, but the accuracy may not be high enough; on the contrary, multi-modal emotion recognition may be more accurate, but it also increases the complexity and cost of the data collection system. This paper considers cross-modal emotion recognition, i.e., using both electroencephalography (EEG) and eye movement in training, but only EEG or eye movement in test. We propose cross-modal contrastive representation distillation (CMCRD), which uses a pre-trained eye movement classification model to assist the training of an EEG classification model, improving feature extraction from EEG signals, or vice versa. During test, only EEG signals (or eye movement signals) are acquired, eliminating the need for multi-modal data. CMCRD not only improves the emotion recognition accuracy, but also makes the system more simplified and practical. Experiments using three different neural network architectures on three multi-modal emotion recognition datasets demonstrated the effectiveness of CMCRD. Compared with the EEG-only model, it improved the average classification accuracy by about 6.2%.

Paper Structure

This paper contains 19 sections, 13 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Comparison of unimodal, multi-modal and cross-modal emotion recognition. EM stands for eye movement.
  • Figure 2: The architecture of CMCRD. Solid arrows represent the data flow, and hollow arrows represent the gradient flow. Both an eye movement (EM) classification model and an EEG classification model are trained, but only the EEG classification model is used in testing.
  • Figure 3: Contrastive learning in CRD and CMCRD.
  • Figure 4: $t$-SNE visualization of features extracted by the DNN model from the second subject in SEED-V. Different colors represent different emotions. (a) EEG-only features in within-subject setting; (b) EEG features from CMCRD in within-subject setting; (c) EEG-only features in cross-subject setting; and, (d) EEG features from CMCRD in cross-subject setting.