Table of Contents
Fetching ...

ECG-EmotionNet: Nested Mixture of Expert (NMoE) Adaptation of ECG-Foundation Model for Driver Emotion Recognition

Nastaran Mansourian, Arash Mohammadi, M. Omair Ahmad, M. N. S. Swamy

TL;DR

ECG-EmotionNet addresses driver emotion recognition under dynamic driving by adapting a pre-trained ECG foundation model through a nested Mixture of Experts that fuses embeddings from all transformer layers while freezing the backbone. This yields richer global and local feature representations using single-channel ECG with significantly fewer trainable parameters. On the manD 1.0 benchmark, it achieves an average accuracy of 82.45% and an F1 score of 77.11% across five emotions, outperforming static-environment baselines with a fraction of the training cost. The approach offers practical benefits for ADAS and HAT in autonomous driving, with robustness to noise and efficient computation, and points to future multimodal integration and real-time deployment.

Abstract

Driver emotion recognition plays a crucial role in driver monitoring systems, enhancing human-autonomy interactions and the trustworthiness of Autonomous Driving (AD). Various physiological and behavioural modalities have been explored for this purpose, with Electrocardiogram (ECG) emerging as a standout choice for real-time emotion monitoring, particularly in dynamic and unpredictable driving conditions. Existing methods, however, often rely on multi-channel ECG signals recorded under static conditions, limiting their applicability in real-world dynamic driving scenarios. To address this limitation, the paper introduces ECG-EmotionNet, a novel architecture designed specifically for emotion recognition in dynamic driving environments. ECG-EmotionNet is constructed by adapting a recently introduced ECG Foundation Model (FM) and uniquely employs single-channel ECG signals, ensuring both robust generalizability and computational efficiency. Unlike conventional adaptation methods such as full fine-tuning, linear probing, or low-rank adaptation, we propose an intuitively pleasing alternative, referred to as the nested Mixture of Experts (MoE) adaptation. More precisely, each transformer layer of the underlying FM is treated as a separate expert, with embeddings extracted from these experts fused using trainable weights within a gating mechanism. This approach enhances the representation of both global and local ECG features, leading to a 6% improvement in accuracy and a 7% increase in the F1 score, all while maintaining computational efficiency. The effectiveness of the proposed ECG-EmotionNet architecture is evaluated using a recently introduced and challenging driver emotion monitoring dataset.

ECG-EmotionNet: Nested Mixture of Expert (NMoE) Adaptation of ECG-Foundation Model for Driver Emotion Recognition

TL;DR

ECG-EmotionNet addresses driver emotion recognition under dynamic driving by adapting a pre-trained ECG foundation model through a nested Mixture of Experts that fuses embeddings from all transformer layers while freezing the backbone. This yields richer global and local feature representations using single-channel ECG with significantly fewer trainable parameters. On the manD 1.0 benchmark, it achieves an average accuracy of 82.45% and an F1 score of 77.11% across five emotions, outperforming static-environment baselines with a fraction of the training cost. The approach offers practical benefits for ADAS and HAT in autonomous driving, with robustness to noise and efficient computation, and points to future multimodal integration and real-time deployment.

Abstract

Driver emotion recognition plays a crucial role in driver monitoring systems, enhancing human-autonomy interactions and the trustworthiness of Autonomous Driving (AD). Various physiological and behavioural modalities have been explored for this purpose, with Electrocardiogram (ECG) emerging as a standout choice for real-time emotion monitoring, particularly in dynamic and unpredictable driving conditions. Existing methods, however, often rely on multi-channel ECG signals recorded under static conditions, limiting their applicability in real-world dynamic driving scenarios. To address this limitation, the paper introduces ECG-EmotionNet, a novel architecture designed specifically for emotion recognition in dynamic driving environments. ECG-EmotionNet is constructed by adapting a recently introduced ECG Foundation Model (FM) and uniquely employs single-channel ECG signals, ensuring both robust generalizability and computational efficiency. Unlike conventional adaptation methods such as full fine-tuning, linear probing, or low-rank adaptation, we propose an intuitively pleasing alternative, referred to as the nested Mixture of Experts (MoE) adaptation. More precisely, each transformer layer of the underlying FM is treated as a separate expert, with embeddings extracted from these experts fused using trainable weights within a gating mechanism. This approach enhances the representation of both global and local ECG features, leading to a 6% improvement in accuracy and a 7% increase in the F1 score, all while maintaining computational efficiency. The effectiveness of the proposed ECG-EmotionNet architecture is evaluated using a recently introduced and challenging driver emotion monitoring dataset.

Paper Structure

This paper contains 5 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: A graphical representation of the proposed ECG-EmotionNet methodology for emotion recognition in dynamic driving scenarios. (a) The method begins with preprocessing raw ECG signals, extracting their representations using the pretrained ECG-FM model, followed by adaptive fusion layer and downstream emotion classification. (b) The pretrained ECG-FM model, comprising a CNN-based feature extractor and a transformer encoder with 12 layers, processes the signals to generate multi-layer embeddings from each transformer layer. (c) Adaptive Expert Fusion Layer aggregates embeddings from all transformer layers using trainable weights ($\alpha_l$). Hooks are registered in each transformer encoder layer to dynamically capture intermediate outputs during the forward pass, enabling the integration of global and local features into a unified representation ($h_{agg}$). (d) Aggregated embeddings undergo pooling, are processed through a hidden layer, and are classified into five emotional categories via a fully connected layer.
  • Figure 2: Illustration of the overlapping window technique for data augmentation, where each window $W_i$ captures a subset of the ECG signal with overlapping regions to generate augmented samples.
  • Figure 3: Comparison of Accuracy: (a) and F1-Score. (b) under different additional noise levels for four strategies: NMoE fine-tuning, CNN fine-tuning, Encoder fine-tuning, and full fine-tuning.
  • Figure 4: Comparison of Accuracy and F1 Score for Models Using All Transformers Embeddings vs. Only the Last Transformer Embeddings.
  • Figure 5: Final learned weights of $\alpha_i$ of the encoder layers after training, illustrating the importance of each layer in the model.