Revisiting Multi-modal Emotion Learning with Broad State Space Models and Probability-guidance Fusion

Yuntao Shou; Tao Meng; Fuchen Zhang; Nan Yin; Keqin Li

Revisiting Multi-modal Emotion Learning with Broad State Space Models and Probability-guidance Fusion

Yuntao Shou, Tao Meng, Fuchen Zhang, Nan Yin, Keqin Li

TL;DR

This paper tackles MERC by integrating long-range contextual modeling at the feature disentanglement stage with inter-modal consistency during fusion. It introduces Broad Mamba, a bidirectional SSM-based module augmented by Broad Learning to explore broad data distributions, and a probability-guided fusion mechanism to weight modalities using predicted label probabilities. Across IEMOCAP and MELD, the approach achieves state-of-the-art results with a compact 1.73M parameter footprint and competitive runtimes, validating both effectiveness and efficiency. The contribution offers a scalable pathway toward next-generation MERC architectures that effectively fuse multi-modal signals while modeling long-range dependencies.

Abstract

Multi-modal Emotion Recognition in Conversation (MERC) has received considerable attention in various fields, e.g., human-computer interaction and recommendation systems. Most existing works perform feature disentanglement and fusion to extract emotional contextual information from multi-modal features and emotion classification. After revisiting the characteristic of MERC, we argue that long-range contextual semantic information should be extracted in the feature disentanglement stage and the inter-modal semantic information consistency should be maximized in the feature fusion stage. Inspired by recent State Space Models (SSMs), Mamba can efficiently model long-distance dependencies. Therefore, in this work, we fully consider the above insights to further improve the performance of MERC. Specifically, on the one hand, in the feature disentanglement stage, we propose a Broad Mamba, which does not rely on a self-attention mechanism for sequence modeling, but uses state space models to compress emotional representation, and utilizes broad learning systems to explore the potential data distribution in broad space. Different from previous SSMs, we design a bidirectional SSM convolution to extract global context information. On the other hand, we design a multi-modal fusion strategy based on probability guidance to maximize the consistency of information between modalities. Experimental results show that the proposed method can overcome the computational and memory limitations of Transformer when modeling long-distance contexts, and has great potential to become a next-generation general architecture in MERC.

Revisiting Multi-modal Emotion Learning with Broad State Space Models and Probability-guidance Fusion

TL;DR

Abstract

Paper Structure (26 sections, 20 equations, 7 figures, 5 tables)

This paper contains 26 sections, 20 equations, 7 figures, 5 tables.

Introduction
Related work
Multi-modal Emotion Recognition in Conversation
State Space Models
Preliminary Information
Multi-modal Feature Extraction
State Space Model
Broad Learning System
The proposed method
Feature Disentanglement
1D-Conv
Broad Mamba
Computation-Efficiency
Feature Fusion
Probability-guided Fusion Model
...and 11 more sections

Figures (7)

Figure 1: An illustrative example of multi-modal emotion recognition in conversation. For each given sentence, it contains three modal information about the speaker, i.e., text, video and audio. The task of MERC is to identify the emotional labels contained in the three modal information.
Figure 2: The overall architecture of Broad Learning System (BLS). $\mathbf{Z}_i$ represents the feature nodes, $\mathbf{H}_i$ represents the enhancement nodes, and $\mathbf{Y}$ represents the predicted labels.
Figure 3: The overall framework of the proposed model. Specifically, we first input the extracted multi-modal features into a 1-D convolutional layer for multi-scale feature extraction and introduce position encoding information to consider the position information of the series in the context. Then we input the obtained multi-modal features with multi-scale information into Broad Mamba to extract contextual semantic information and explore the potential data distribution in the broad space. Finally, we use a probability-guidance fusion model to complete the fusion of multi-modal features and achieve emotion prediction.
Figure 4: The overall architecture of Broad Mamba. We use a bidirectional SSM to encode forward and reverse contextual semantic information.
Figure 5: Emotion recognition effects of different fusion methods on the IEMOCAP and MELD datasets. The experimental results are statistically significant ($t$-test with $p < 0.05$).
...and 2 more figures

Revisiting Multi-modal Emotion Learning with Broad State Space Models and Probability-guidance Fusion

TL;DR

Abstract

Revisiting Multi-modal Emotion Learning with Broad State Space Models and Probability-guidance Fusion

Authors

TL;DR

Abstract

Table of Contents

Figures (7)