Cooperative Sentiment Agents for Multimodal Sentiment Analysis
Shanmin Wang, Hui Shuai, Qingshan Liu, Fei Wang
TL;DR
Co-SA tackles multimodal sentiment analysis by learning adaptive cross-modal interactions through cooperative sentiment agents. It first isolates sentiment-specific information within each modality using MSD and captures temporal dynamics with DPSR in the SAE phase, then jointly optimizes modality-specific policies in the SAC phase to construct a robust joint representation guided by a unified reward. The approach yields consistent improvements over state-of-the-art baselines on MSA and MER datasets and provides insights into the contributions of modality-shared and modality-unique features. The framework demonstrates the value of treating multimodal fusion as a cooperative, policy-driven process rather than a fixed fusion scheme, with practical benefits for human-centered AI systems.
Abstract
In this paper, we propose a new Multimodal Representation Learning (MRL) method for Multimodal Sentiment Analysis (MSA), which facilitates the adaptive interaction between modalities through Cooperative Sentiment Agents, named Co-SA. Co-SA comprises two critical components: the Sentiment Agents Establishment (SAE) phase and the Sentiment Agents Cooperation (SAC) phase. During the SAE phase, each sentiment agent deals with an unimodal signal and highlights explicit dynamic sentiment variations within the modality via the Modality-Sentiment Disentanglement (MSD) and Deep Phase Space Reconstruction (DPSR) modules. Subsequently, in the SAC phase, Co-SA meticulously designs task-specific interaction mechanisms for sentiment agents so that coordinating multimodal signals to learn the joint representation. Specifically, Co-SA equips an independent policy model for each sentiment agent that captures significant properties within the modality. These policies are optimized mutually through the unified reward adaptive to downstream tasks. Benefitting from the rewarding mechanism, Co-SA transcends the limitation of pre-defined fusion modes and adaptively captures unimodal properties for MRL in the multimodal interaction setting. To demonstrate the effectiveness of Co-SA, we apply it to address Multimodal Sentiment Analysis (MSA) and Multimodal Emotion Recognition (MER) tasks. Our comprehensive experimental results demonstrate that Co-SA excels at discovering diverse cross-modal features, encompassing both common and complementary aspects. The code can be available at https://github.com/smwanghhh/Co-SA.
