Table of Contents
Fetching ...

MSE-Adapter: A Lightweight Plugin Endowing LLMs with the Capability to Perform Multimodal Sentiment Analysis and Emotion Recognition

Yang Yang, Xunde Dong, Yupeng Qiang

TL;DR

This work tackles the high computational cost and limited generalization of PLMs in multimodal sentiment analysis and emotion recognition by introducing MSE-Adapter, a lightweight plugin that freezes the backbone LLM while training only a small adapter (approximately $2.6$M to $2.8$M parameters for 6–7B backbones). Central to the approach are the Text-Guide-Mixer (TGM), which aligns non-textual modalities with text via Hadamard interactions, and the Multi-Scale-Fusion (MSF) module, which performs early fusion of non-textual features before LLM processing. Empirical results on MOSEI, SIMS-V2, MELD, and CHERMA show competitive or superior performance across English and Chinese datasets, with clear ablations demonstrating the value of TGM and MSF and the necessity of non-textual modalities. The solution offers practical impact by enabling multimodal reasoning on consumer GPUs while preserving the LLM’s general capabilities, and it provides a new baseline for integrating LLMs with multimodal sentiment tasks.

Abstract

Current Multimodal Sentiment Analysis (MSA) and Emotion Recognition in Conversations (ERC) methods based on pre-trained language models exhibit two primary limitations: 1) Once trained for MSA and ERC tasks, these pre-trained language models lose their original generalized capabilities. 2) They demand considerable computational resources. As the size of pre-trained language models continues to grow, training larger multimodal sentiment analysis models using previous approaches could result in unnecessary computational cost. In response to this challenge, we propose \textbf{M}ultimodal \textbf{S}entiment Analysis and \textbf{E}motion Recognition \textbf{Adapter} (MSE-Adapter), a lightweight and adaptable plugin. This plugin enables a large language model (LLM) to carry out MSA or ERC tasks with minimal computational overhead (only introduces approximately 2.6M to 2.8M trainable parameters upon the 6/7B models), while preserving the intrinsic capabilities of the LLM. In the MSE-Adapter, the Text-Guide-Mixer (TGM) module is introduced to establish explicit connections between non-textual and textual modalities through the Hadamard product. This allows non-textual modalities to better align with textual modalities at the feature level, promoting the generation of higher-quality pseudo tokens. Extensive experiments were conducted on four public English and Chinese datasets using consumer-grade GPUs and open-source LLMs (Qwen-1.8B, ChatGLM3-6B-base, and LLaMA2-7B) as the backbone. The results demonstrate the effectiveness of the proposed plugin. The code will be released on GitHub after a blind review.

MSE-Adapter: A Lightweight Plugin Endowing LLMs with the Capability to Perform Multimodal Sentiment Analysis and Emotion Recognition

TL;DR

This work tackles the high computational cost and limited generalization of PLMs in multimodal sentiment analysis and emotion recognition by introducing MSE-Adapter, a lightweight plugin that freezes the backbone LLM while training only a small adapter (approximately M to M parameters for 6–7B backbones). Central to the approach are the Text-Guide-Mixer (TGM), which aligns non-textual modalities with text via Hadamard interactions, and the Multi-Scale-Fusion (MSF) module, which performs early fusion of non-textual features before LLM processing. Empirical results on MOSEI, SIMS-V2, MELD, and CHERMA show competitive or superior performance across English and Chinese datasets, with clear ablations demonstrating the value of TGM and MSF and the necessity of non-textual modalities. The solution offers practical impact by enabling multimodal reasoning on consumer GPUs while preserving the LLM’s general capabilities, and it provides a new baseline for integrating LLMs with multimodal sentiment tasks.

Abstract

Current Multimodal Sentiment Analysis (MSA) and Emotion Recognition in Conversations (ERC) methods based on pre-trained language models exhibit two primary limitations: 1) Once trained for MSA and ERC tasks, these pre-trained language models lose their original generalized capabilities. 2) They demand considerable computational resources. As the size of pre-trained language models continues to grow, training larger multimodal sentiment analysis models using previous approaches could result in unnecessary computational cost. In response to this challenge, we propose \textbf{M}ultimodal \textbf{S}entiment Analysis and \textbf{E}motion Recognition \textbf{Adapter} (MSE-Adapter), a lightweight and adaptable plugin. This plugin enables a large language model (LLM) to carry out MSA or ERC tasks with minimal computational overhead (only introduces approximately 2.6M to 2.8M trainable parameters upon the 6/7B models), while preserving the intrinsic capabilities of the LLM. In the MSE-Adapter, the Text-Guide-Mixer (TGM) module is introduced to establish explicit connections between non-textual and textual modalities through the Hadamard product. This allows non-textual modalities to better align with textual modalities at the feature level, promoting the generation of higher-quality pseudo tokens. Extensive experiments were conducted on four public English and Chinese datasets using consumer-grade GPUs and open-source LLMs (Qwen-1.8B, ChatGLM3-6B-base, and LLaMA2-7B) as the backbone. The results demonstrate the effectiveness of the proposed plugin. The code will be released on GitHub after a blind review.

Paper Structure

This paper contains 34 sections, 6 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: The comprehensive framework integrating MSE-Adapter with LLM.
  • Figure 2: The architecture of MSE-Adapter.
  • Figure 3: The Task-specific-prompt corresponding to different datasets.
  • Figure 4: Performance of MSE-Adapter with different number of training data.