Table of Contents
Fetching ...

VR Based Emotion Recognition Using Deep Multimodal Fusion With Biosignals Across Multiple Anatomical Domains

Pubudu L. Indrasiri, Bipasha Kashyap, Chandima Kolambahewage, Bahareh Nakisa, Kiran Ijaz, Pubudu N. Pathirana

TL;DR

This paper tackles VR-based emotion recognition by leveraging multi-domain biosignals from the head, trunk, and peripheral regions. It introduces EMO-MSASE, a multi-scale attention LSTM framework augmented with Squeeze-and-Excitation blocks to fuse modalities across head, trunk, and peripheral sensors for valence and arousal classification. Key contributions include a systematic multi-domain data collection with synchronized devices, domain-wise signal importance analysis, and robust cross-domain fusion showing significant gains over unimodal baselines, validated via Group K-Fold and Leave-One-Subject-Out strategies. The proposed method advances affective computing in immersive environments, enabling more robust and scalable emotion-aware VR applications with practical implications for HCI and healthcare.

Abstract

Emotion recognition is significantly enhanced by integrating multimodal biosignals and IMU data from multiple domains. In this paper, we introduce a novel multi-scale attention-based LSTM architecture, combined with Squeeze-and-Excitation (SE) blocks, by leveraging multi-domain signals from the head (Meta Quest Pro VR headset), trunk (Equivital Vest), and peripheral (Empatica Embrace Plus) during affect elicitation via visual stimuli. Signals from 23 participants were recorded, alongside self-assessed valence and arousal ratings after each stimulus. LSTM layers extract features from each modality, while multi-scale attention captures fine-grained temporal dependencies, and SE blocks recalibrate feature importance prior to classification. We assess which domain's signals carry the most distinctive emotional information during VR experiences, identifying key biosignals contributing to emotion detection. The proposed architecture, validated in a user study, demonstrates superior performance in classifying valance and arousal level (high / low), showcasing the efficacy of multi-domain and multi-modal fusion with biosignals (e.g., TEMP, EDA) with IMU data (e.g., accelerometer) for emotion recognition in real-world applications.

VR Based Emotion Recognition Using Deep Multimodal Fusion With Biosignals Across Multiple Anatomical Domains

TL;DR

This paper tackles VR-based emotion recognition by leveraging multi-domain biosignals from the head, trunk, and peripheral regions. It introduces EMO-MSASE, a multi-scale attention LSTM framework augmented with Squeeze-and-Excitation blocks to fuse modalities across head, trunk, and peripheral sensors for valence and arousal classification. Key contributions include a systematic multi-domain data collection with synchronized devices, domain-wise signal importance analysis, and robust cross-domain fusion showing significant gains over unimodal baselines, validated via Group K-Fold and Leave-One-Subject-Out strategies. The proposed method advances affective computing in immersive environments, enabling more robust and scalable emotion-aware VR applications with practical implications for HCI and healthcare.

Abstract

Emotion recognition is significantly enhanced by integrating multimodal biosignals and IMU data from multiple domains. In this paper, we introduce a novel multi-scale attention-based LSTM architecture, combined with Squeeze-and-Excitation (SE) blocks, by leveraging multi-domain signals from the head (Meta Quest Pro VR headset), trunk (Equivital Vest), and peripheral (Empatica Embrace Plus) during affect elicitation via visual stimuli. Signals from 23 participants were recorded, alongside self-assessed valence and arousal ratings after each stimulus. LSTM layers extract features from each modality, while multi-scale attention captures fine-grained temporal dependencies, and SE blocks recalibrate feature importance prior to classification. We assess which domain's signals carry the most distinctive emotional information during VR experiences, identifying key biosignals contributing to emotion detection. The proposed architecture, validated in a user study, demonstrates superior performance in classifying valance and arousal level (high / low), showcasing the efficacy of multi-domain and multi-modal fusion with biosignals (e.g., TEMP, EDA) with IMU data (e.g., accelerometer) for emotion recognition in real-world applications.

Paper Structure

This paper contains 27 sections, 12 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Data collection arrangement for a single participant.
  • Figure 2: End-to-end data collection setup. (a) a participant wearing the Equivital Vest (inside his shirt), Empatica EmbracePlus and Meta Quest Pro VR headset with two controllers in his hands. Gel electrodes for measuring GSR activity are attached to the middle and ring fingers of the left hand. The participant is asked to comfortably move around and interact with the VR content while standing, (b) the high level data flow diagram across three domains, (c) bespoke VR application architecture.
  • Figure 3: Proposed multi-domain leveraged multimodal deep learning architecture for emotion (valance, arousal) classification using multi-scale attention, LSTM models and squeeze and excitation block. Only best performing modalities from each domain are outlined.
  • Figure 4: Proposed multi-attention (MSA) based LSTM feature extractor for a modality.
  • Figure 5: Comparison of unimodal valence and arousal accuracy across four different cases (general, majority, males only and G2) for three domains. Blue color indicates the trunk domain, green color indicates the peripheral domain and red color indicates head domain, (a) shows the valence comparison, and (b) shows the arousal comparison.