Table of Contents
Fetching ...

Hierarchical Adaptive Expert for Multimodal Sentiment Analysis

Jiahao Qin, Feng Liu, Lu Zong

TL;DR

The paper tackles multimodal sentiment analysis by addressing the challenge of distinguishing modality-shared versus modality-specific information. It introduces HAEMSA, a hierarchical adaptive-expert framework that uses evolutionary architecture search, cross-modal knowledge transfer, and multi-task learning to learn rich, multi-granularity representations across text, audio, and visual modalities. Through extensive experiments on CMU-MOSI, CMU-MOSEI, and IEMOCAP, HAEMSA demonstrates state-of-the-art improvements in accuracy, MAE, and weighted-F1, and ablation studies confirm the importance of each component. While showing strong performance, the work also discusses computational costs and robustness issues, outlining future work on efficiency, missing-modality handling, and cross-cultural generalization.

Abstract

Multimodal sentiment analysis has emerged as a critical tool for understanding human emotions across diverse communication channels. While existing methods have made significant strides, they often struggle to effectively differentiate and integrate modality-shared and modality-specific information, limiting the performance of multimodal learning. To address this challenge, we propose the Hierarchical Adaptive Expert for Multimodal Sentiment Analysis (HAEMSA), a novel framework that synergistically combines evolutionary optimization, cross-modal knowledge transfer, and multi-task learning. HAEMSA employs a hierarchical structure of adaptive experts to capture both global and local modality representations, enabling more nuanced sentiment analysis. Our approach leverages evolutionary algorithms to dynamically optimize network architectures and modality combinations, adapting to both partial and full modality scenarios. Extensive experiments demonstrate HAEMSA's superior performance across multiple benchmark datasets. On CMU-MOSEI, HAEMSA achieves a 2.6% increase in 7-class accuracy and a 0.059 decrease in MAE compared to the previous best method. For CMU-MOSI, we observe a 6.3% improvement in 7-class accuracy and a 0.058 reduction in MAE. On IEMOCAP, HAEMSA outperforms the state-of-the-art by 2.84% in weighted-F1 score for emotion recognition. These results underscore HAEMSA's effectiveness in capturing complex multimodal interactions and generalizing across different emotional contexts.

Hierarchical Adaptive Expert for Multimodal Sentiment Analysis

TL;DR

The paper tackles multimodal sentiment analysis by addressing the challenge of distinguishing modality-shared versus modality-specific information. It introduces HAEMSA, a hierarchical adaptive-expert framework that uses evolutionary architecture search, cross-modal knowledge transfer, and multi-task learning to learn rich, multi-granularity representations across text, audio, and visual modalities. Through extensive experiments on CMU-MOSI, CMU-MOSEI, and IEMOCAP, HAEMSA demonstrates state-of-the-art improvements in accuracy, MAE, and weighted-F1, and ablation studies confirm the importance of each component. While showing strong performance, the work also discusses computational costs and robustness issues, outlining future work on efficiency, missing-modality handling, and cross-cultural generalization.

Abstract

Multimodal sentiment analysis has emerged as a critical tool for understanding human emotions across diverse communication channels. While existing methods have made significant strides, they often struggle to effectively differentiate and integrate modality-shared and modality-specific information, limiting the performance of multimodal learning. To address this challenge, we propose the Hierarchical Adaptive Expert for Multimodal Sentiment Analysis (HAEMSA), a novel framework that synergistically combines evolutionary optimization, cross-modal knowledge transfer, and multi-task learning. HAEMSA employs a hierarchical structure of adaptive experts to capture both global and local modality representations, enabling more nuanced sentiment analysis. Our approach leverages evolutionary algorithms to dynamically optimize network architectures and modality combinations, adapting to both partial and full modality scenarios. Extensive experiments demonstrate HAEMSA's superior performance across multiple benchmark datasets. On CMU-MOSEI, HAEMSA achieves a 2.6% increase in 7-class accuracy and a 0.059 decrease in MAE compared to the previous best method. For CMU-MOSI, we observe a 6.3% improvement in 7-class accuracy and a 0.058 reduction in MAE. On IEMOCAP, HAEMSA outperforms the state-of-the-art by 2.84% in weighted-F1 score for emotion recognition. These results underscore HAEMSA's effectiveness in capturing complex multimodal interactions and generalizing across different emotional contexts.

Paper Structure

This paper contains 21 sections, 14 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of the Hierarchical Adaptive Expert Network (HAEN) and the evolutionary optimization process. The left portion illustrates the hierarchical structure, where modal-specific experts receive knowledge transfers from modal-shared experts to enable cross-modal knowledge fusion and multi-task optimization. The right portion depicts the evolutionary algorithm’s key steps—Selection, Recombination, and Mutation—adapting network configurations over iterative generations.
  • Figure 2: Overview of the Hierarchical Adaptive Expert Network (HAEN) for MSA. The network consists of five main components: (A) an evolutionary adaptive modal-shared expert with a hierarchical structure that learns granular representations, (B) modality-specific experts that handle single-modality information, (C) collaborative cross-modal integration, (D) attention-based task-driven feature selection module, and (E) an adaptive tower network that utilizes task-driven gates for multi-task learning to capture sentiment analysis task dependencies.
  • Figure 3: t-SNE visualization of multimodal embeddings on CMU-MOSEI.