All rivers run into the sea: Unified Modality Brain-like Emotional Central Mechanism
Xinji Mai, Junxiong Lin, Haoran Wang, Zeng Tao, Yan Wang, Shaoqi Yan, Xuan Tong, Jiawen Yu, Boyang Wang, Ziheng Zhou, Qing Zhao, Shuyong Gao, Wenqiang Zhang
TL;DR
UBA: The paper addresses robust multimodal emotion recognition under modality missingness by introducing UMBEnet, a brain-inspired network that unifies multiple modalities through a Dual-Stream structure and a Sparse Feature Fusion module. It leverages a trainable Prompt Pool and inherent prompts to simulate neural activation and cross-modal integration, guided by a brain-like emotional processing framework. The approach achieves state-of-the-art results on large DFER benchmarks (DFEW, FERV39k, MAFW), particularly excelling when modalities are missing or in multimodal contexts, and it validates its efficacy through comprehensive ablations and a two-stage training strategy. This work advances robust, interpretable multimodal emotion recognition with practical implications for real-world affective computing where data can be incomplete or variable in modality availability.
Abstract
In the field of affective computing, fully leveraging information from a variety of sensory modalities is essential for the comprehensive understanding and processing of human emotions. Inspired by the process through which the human brain handles emotions and the theory of cross-modal plasticity, we propose UMBEnet, a brain-like unified modal affective processing network. The primary design of UMBEnet includes a Dual-Stream (DS) structure that fuses inherent prompts with a Prompt Pool and a Sparse Feature Fusion (SFF) module. The design of the Prompt Pool is aimed at integrating information from different modalities, while inherent prompts are intended to enhance the system's predictive guidance capabilities and effectively manage knowledge related to emotion classification. Moreover, considering the sparsity of effective information across different modalities, the SSF module aims to make full use of all available sensory data through the sparse integration of modality fusion prompts and inherent prompts, maintaining high adaptability and sensitivity to complex emotional states. Extensive experiments on the largest benchmark datasets in the Dynamic Facial Expression Recognition (DFER) field, including DFEW, FERV39k, and MAFW, have proven that UMBEnet consistently outperforms the current state-of-the-art methods. Notably, in scenarios of Modality Missingness and multimodal contexts, UMBEnet significantly surpasses the leading current methods, demonstrating outstanding performance and adaptability in tasks that involve complex emotional understanding with rich multimodal information.
