Table of Contents
Fetching ...

Memory-guided Prototypical Co-occurrence Learning for Mixed Emotion Recognition

Ming Li, Yong-Jin Liu, Fang Liu, Huankun Sheng, Yeying Fan, Yixiang Wei, Minnan Luo, Weizhan Zhang, Wenping Wang

TL;DR

A Memory-guided Prototypical Co-occurrence Learning (MPCL) framework that explicitly models emotion co-occurrence patterns and introduces a memory retrieval strategy to extract semantic-level co-occurrence associations across emotion categories.

Abstract

Emotion recognition from multi-modal physiological and behavioral signals plays a pivotal role in affective computing, yet most existing models remain constrained to the prediction of singular emotions in controlled laboratory settings. Real-world human emotional experiences, by contrast, are often characterized by the simultaneous presence of multiple affective states, spurring recent interest in mixed emotion recognition as an emotion distribution learning problem. Current approaches, however, often neglect the valence consistency and structured correlations inherent among coexisting emotions. To address this limitation, we propose a Memory-guided Prototypical Co-occurrence Learning (MPCL) framework that explicitly models emotion co-occurrence patterns. Specifically, we first fuse multi-modal signals via a multi-scale associative memory mechanism. To capture cross-modal semantic relationships, we construct emotion-specific prototype memory banks, yielding rich physiological and behavioral representations, and employ prototype relation distillation to ensure cross-modal alignment in the latent prototype space. Furthermore, inspired by human cognitive memory systems, we introduce a memory retrieval strategy to extract semantic-level co-occurrence associations across emotion categories. Through this bottom-up hierarchical abstraction process, our model learns affectively informative representations for accurate emotion distribution prediction. Comprehensive experiments on two public datasets demonstrate that MPCL consistently outperforms state-of-the-art methods in mixed emotion recognition, both quantitatively and qualitatively.

Memory-guided Prototypical Co-occurrence Learning for Mixed Emotion Recognition

TL;DR

A Memory-guided Prototypical Co-occurrence Learning (MPCL) framework that explicitly models emotion co-occurrence patterns and introduces a memory retrieval strategy to extract semantic-level co-occurrence associations across emotion categories.

Abstract

Emotion recognition from multi-modal physiological and behavioral signals plays a pivotal role in affective computing, yet most existing models remain constrained to the prediction of singular emotions in controlled laboratory settings. Real-world human emotional experiences, by contrast, are often characterized by the simultaneous presence of multiple affective states, spurring recent interest in mixed emotion recognition as an emotion distribution learning problem. Current approaches, however, often neglect the valence consistency and structured correlations inherent among coexisting emotions. To address this limitation, we propose a Memory-guided Prototypical Co-occurrence Learning (MPCL) framework that explicitly models emotion co-occurrence patterns. Specifically, we first fuse multi-modal signals via a multi-scale associative memory mechanism. To capture cross-modal semantic relationships, we construct emotion-specific prototype memory banks, yielding rich physiological and behavioral representations, and employ prototype relation distillation to ensure cross-modal alignment in the latent prototype space. Furthermore, inspired by human cognitive memory systems, we introduce a memory retrieval strategy to extract semantic-level co-occurrence associations across emotion categories. Through this bottom-up hierarchical abstraction process, our model learns affectively informative representations for accurate emotion distribution prediction. Comprehensive experiments on two public datasets demonstrate that MPCL consistently outperforms state-of-the-art methods in mixed emotion recognition, both quantitatively and qualitatively.
Paper Structure (39 sections, 20 equations, 9 figures, 8 tables)

This paper contains 39 sections, 20 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Overview of the MPCL Framework, which consists of three stages. (a) Multimodal Feature Extraction and Fusion: the Multi‑Scale Associative Fusion (MSAF) module fuses multi-modal physiological signals, while a separate behavioral encoder extracts structured behavioral feature. (b) Prototypical Alignment and Co‑occurrence Learning: emotion prototype memory banks are constructed from both modalities. The Prototype Relation Distillation (PRD) module enforces cross‑modal structural alignment, and the Prototypical Co‑occurrence Learning (PCL) module captures semantic‑level co‑occurrence patterns through memory retrieval. (c) Hierarchical Semantic Compression and Distribution Prediction: the Hierarchical Semantic Compression (HSC) module abstracts affective representations via a bottom‑up strategy, followed by a classifier that outputs the final emotion distribution.
  • Figure 2: Fusion of multimodal physiological signals. EEG serves as the primary modality to integrate complementary information from auxiliary modalities (PPG and GSR) via a multi-scale associative memory mechanism. Three distinct scaling factors $\mathcal{B} = \{\beta_{\text{low}}, \beta_{\text{mid}}, \beta_{\text{high}}\}$ are employed to modulate the granularity of information aggregation.
  • Figure 3: Prototype alignment process. Prototype memory banks are first constructed separately for the physiological and behavioral modalities, respectively. A semantics-enriched representation is then obtained as a weighted combination of prototypes from each bank. Meanwhile, the Prototype Relation Distillation (PRD) strategy enforces semantic structural consistency across the two modalities.
  • Figure 4: Architecture of Prototypical Co‑occurrence Learning (PCL). The physiological embedding $x_i$ and the behavioral embedding $y_i$ retrieve associated representations $U_{x_i}$ and $U_{y_i}$, respectively, via the Hopfield network. Within the physiological embedding space, $U_{x_i}$ (retrieved via the physiological query) serves as an anchor that is contrasted with the positive sample $U_{y_i}$ (retrieved from the corresponding behavioral query) and with negative samples $U_{y_j}$ (retrieved from mismatched behavioral queries, where $j \neq i$). The same procedure is applied symmetrically in the behavioral embedding space $V$.
  • Figure 5: Comparison of emotion distribution predictions between MPCL and state-of-the-art baselines on two representative samples (Subject 26). GT denotes the ground-truth distribution. Emotion indices 1–10 correspond to: inspired, alert, excited, enthusiastic, determined, afraid, upset, nervous, scared, and distressed.
  • ...and 4 more figures