Table of Contents
Fetching ...

All rivers run into the sea: Unified Modality Brain-like Emotional Central Mechanism

Xinji Mai, Junxiong Lin, Haoran Wang, Zeng Tao, Yan Wang, Shaoqi Yan, Xuan Tong, Jiawen Yu, Boyang Wang, Ziheng Zhou, Qing Zhao, Shuyong Gao, Wenqiang Zhang

TL;DR

UBA: The paper addresses robust multimodal emotion recognition under modality missingness by introducing UMBEnet, a brain-inspired network that unifies multiple modalities through a Dual-Stream structure and a Sparse Feature Fusion module. It leverages a trainable Prompt Pool and inherent prompts to simulate neural activation and cross-modal integration, guided by a brain-like emotional processing framework. The approach achieves state-of-the-art results on large DFER benchmarks (DFEW, FERV39k, MAFW), particularly excelling when modalities are missing or in multimodal contexts, and it validates its efficacy through comprehensive ablations and a two-stage training strategy. This work advances robust, interpretable multimodal emotion recognition with practical implications for real-world affective computing where data can be incomplete or variable in modality availability.

Abstract

In the field of affective computing, fully leveraging information from a variety of sensory modalities is essential for the comprehensive understanding and processing of human emotions. Inspired by the process through which the human brain handles emotions and the theory of cross-modal plasticity, we propose UMBEnet, a brain-like unified modal affective processing network. The primary design of UMBEnet includes a Dual-Stream (DS) structure that fuses inherent prompts with a Prompt Pool and a Sparse Feature Fusion (SFF) module. The design of the Prompt Pool is aimed at integrating information from different modalities, while inherent prompts are intended to enhance the system's predictive guidance capabilities and effectively manage knowledge related to emotion classification. Moreover, considering the sparsity of effective information across different modalities, the SSF module aims to make full use of all available sensory data through the sparse integration of modality fusion prompts and inherent prompts, maintaining high adaptability and sensitivity to complex emotional states. Extensive experiments on the largest benchmark datasets in the Dynamic Facial Expression Recognition (DFER) field, including DFEW, FERV39k, and MAFW, have proven that UMBEnet consistently outperforms the current state-of-the-art methods. Notably, in scenarios of Modality Missingness and multimodal contexts, UMBEnet significantly surpasses the leading current methods, demonstrating outstanding performance and adaptability in tasks that involve complex emotional understanding with rich multimodal information.

All rivers run into the sea: Unified Modality Brain-like Emotional Central Mechanism

TL;DR

UBA: The paper addresses robust multimodal emotion recognition under modality missingness by introducing UMBEnet, a brain-inspired network that unifies multiple modalities through a Dual-Stream structure and a Sparse Feature Fusion module. It leverages a trainable Prompt Pool and inherent prompts to simulate neural activation and cross-modal integration, guided by a brain-like emotional processing framework. The approach achieves state-of-the-art results on large DFER benchmarks (DFEW, FERV39k, MAFW), particularly excelling when modalities are missing or in multimodal contexts, and it validates its efficacy through comprehensive ablations and a two-stage training strategy. This work advances robust, interpretable multimodal emotion recognition with practical implications for real-world affective computing where data can be incomplete or variable in modality availability.

Abstract

In the field of affective computing, fully leveraging information from a variety of sensory modalities is essential for the comprehensive understanding and processing of human emotions. Inspired by the process through which the human brain handles emotions and the theory of cross-modal plasticity, we propose UMBEnet, a brain-like unified modal affective processing network. The primary design of UMBEnet includes a Dual-Stream (DS) structure that fuses inherent prompts with a Prompt Pool and a Sparse Feature Fusion (SFF) module. The design of the Prompt Pool is aimed at integrating information from different modalities, while inherent prompts are intended to enhance the system's predictive guidance capabilities and effectively manage knowledge related to emotion classification. Moreover, considering the sparsity of effective information across different modalities, the SSF module aims to make full use of all available sensory data through the sparse integration of modality fusion prompts and inherent prompts, maintaining high adaptability and sensitivity to complex emotional states. Extensive experiments on the largest benchmark datasets in the Dynamic Facial Expression Recognition (DFER) field, including DFEW, FERV39k, and MAFW, have proven that UMBEnet consistently outperforms the current state-of-the-art methods. Notably, in scenarios of Modality Missingness and multimodal contexts, UMBEnet significantly surpasses the leading current methods, demonstrating outstanding performance and adaptability in tasks that involve complex emotional understanding with rich multimodal information.
Paper Structure (16 sections, 12 equations, 7 figures, 4 tables)

This paper contains 16 sections, 12 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overall architecture of UMBEnet. Figure 2a shows a brain-like emotional processing framework (BEPF). The left half of the diagram represents an unbalanced encoder, with the majority of parameters dedicated to visual encoding, while the right half shows the activated prompts. After multimodal information is encoded by the unbalanced encoder, it activates multimodal prompts in the Prompt Pool via a mapping function. These prompts, together with inherent prompts, undergo sparse feature fusion, and their similarity with the multimodal information is calculated. Figure 2b illustrates the structure of the Sparse Feature Fusion (SFF), including how multimodal prompts are merged with inherent prompts. Figure 2c presents the architecture of the Dual-Stream (DS), with the left side showing the actual structure and the right side providing a flattened perspective to concretely understand the Prompt Pool and its activation mechanism.
  • Figure 2: Demonstration of the Prompt Pool's functionality in processing unimodal and multimodal information.
  • Figure 3: The operation of the activation mechanism modeled after neural impulse transmission is depicted on the left. In this process, neurotransmitters, once received by receptors during transmission, convert not into the neurotransmitters themselves but into electrical signals, analogous to a specialized key-value pair system where receptors and electrical signals correlate. The top right corner illustrates the query function, representing selectable mapping functions within this framework. This neural-inspired approach provides a biomimetic method for prompt activation, reflecting the intricacy and efficiency of neural communication in UMBEnet's architecture.
  • Figure 4: The training strategy of UMBEnet unfolds in 2-stages: First, prompts are trained with unimodal inputs; Second, prompts activated in the first stage are aggregated and retrained to enhance integration and responsiveness.
  • Figure 5: Features before and after processing by the model. The left side displays features just entered into the model, scattered overall; the right side shows features before output, demonstrating improved clustering. 0-10 in the legend represents 11-class classification.
  • ...and 2 more figures