Table of Contents
Fetching ...

Multimodal Multi-loss Fusion Network for Sentiment Analysis

Zehui Wu, Ziwei Gong, Jaywon Koo, Julia Hirschberg

TL;DR

The paper tackles multimodal sentiment analysis by optimizing feature encoders and fusion strategies across audio and text. It introduces the MMML framework, featuring a separate Feature Network (RoBERTa for text; HuBERT/Data2Vec for audio) and a Transformer-based Fusion Network with cross-attention, augmented by multi-loss training, original signal restoration, and context modeling. Key findings show that pre-trained audio features and audio-text fusion yield state-of-the-art results across CMU-MOSI, CMU-MOSEI, and CH-SIMS, while context modeling and modality-specific multi-loss further boost performance, particularly when modalities have distinct labels. Collectively, these results provide a practical roadmap for feature selection and fusion in multimodal sentiment analysis, though the study is limited to two languages and omits vision features for efficiency and scope considerations.

Abstract

This paper investigates the optimal selection and fusion of feature encoders across multiple modalities and combines these in one neural network to improve sentiment detection. We compare different fusion methods and examine the impact of multi-loss training within the multi-modality fusion network, identifying surprisingly important findings relating to subnet performance. We have also found that integrating context significantly enhances model performance. Our best model achieves state-of-the-art performance for three datasets (CMU-MOSI, CMU-MOSEI and CH-SIMS). These results suggest a roadmap toward an optimized feature selection and fusion approach for enhancing sentiment detection in neural networks.

Multimodal Multi-loss Fusion Network for Sentiment Analysis

TL;DR

The paper tackles multimodal sentiment analysis by optimizing feature encoders and fusion strategies across audio and text. It introduces the MMML framework, featuring a separate Feature Network (RoBERTa for text; HuBERT/Data2Vec for audio) and a Transformer-based Fusion Network with cross-attention, augmented by multi-loss training, original signal restoration, and context modeling. Key findings show that pre-trained audio features and audio-text fusion yield state-of-the-art results across CMU-MOSI, CMU-MOSEI, and CH-SIMS, while context modeling and modality-specific multi-loss further boost performance, particularly when modalities have distinct labels. Collectively, these results provide a practical roadmap for feature selection and fusion in multimodal sentiment analysis, though the study is limited to two languages and omits vision features for efficiency and scope considerations.

Abstract

This paper investigates the optimal selection and fusion of feature encoders across multiple modalities and combines these in one neural network to improve sentiment detection. We compare different fusion methods and examine the impact of multi-loss training within the multi-modality fusion network, identifying surprisingly important findings relating to subnet performance. We have also found that integrating context significantly enhances model performance. Our best model achieves state-of-the-art performance for three datasets (CMU-MOSI, CMU-MOSEI and CH-SIMS). These results suggest a roadmap toward an optimized feature selection and fusion approach for enhancing sentiment detection in neural networks.
Paper Structure (30 sections, 3 equations, 2 figures, 8 tables)

This paper contains 30 sections, 3 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Our Model Structure
  • Figure 2: Model Variations