Table of Contents
Fetching ...

Tensor Fusion Network for Multimodal Sentiment Analysis

Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, Louis-Philippe Morency

TL;DR

The paper tackles multimodal sentiment analysis by modeling intra-modality and inter-modality dynamics across language, visual, and acoustic signals. It introduces the Tensor Fusion Network (TFN) with Modality Embedding Subnetworks, a Tensor Fusion Layer that explicitly captures unimodal, bimodal, and trimodal interactions, and a Sentiment Inference Subnetwork for diverse output tasks. Empirical results on the CMU-MOSI dataset demonstrate state-of-the-art performance for multimodal sentiment analysis and for each unimodal modality, with ablations underscoring the value of trimodal dynamics. Qualitative analyses further show TFN's ability to leverage cross-modal cues to resolve sentiment that language alone cannot determine.

Abstract

Multimodal sentiment analysis is an increasingly popular research area, which extends the conventional language-based definition of sentiment analysis to a multimodal setup where other relevant modalities accompany language. In this paper, we pose the problem of multimodal sentiment analysis as modeling intra-modality and inter-modality dynamics. We introduce a novel model, termed Tensor Fusion Network, which learns both such dynamics end-to-end. The proposed approach is tailored for the volatile nature of spoken language in online videos as well as accompanying gestures and voice. In the experiments, our model outperforms state-of-the-art approaches for both multimodal and unimodal sentiment analysis.

Tensor Fusion Network for Multimodal Sentiment Analysis

TL;DR

The paper tackles multimodal sentiment analysis by modeling intra-modality and inter-modality dynamics across language, visual, and acoustic signals. It introduces the Tensor Fusion Network (TFN) with Modality Embedding Subnetworks, a Tensor Fusion Layer that explicitly captures unimodal, bimodal, and trimodal interactions, and a Sentiment Inference Subnetwork for diverse output tasks. Empirical results on the CMU-MOSI dataset demonstrate state-of-the-art performance for multimodal sentiment analysis and for each unimodal modality, with ablations underscoring the value of trimodal dynamics. Qualitative analyses further show TFN's ability to leverage cross-modal cues to resolve sentiment that language alone cannot determine.

Abstract

Multimodal sentiment analysis is an increasingly popular research area, which extends the conventional language-based definition of sentiment analysis to a multimodal setup where other relevant modalities accompany language. In this paper, we pose the problem of multimodal sentiment analysis as modeling intra-modality and inter-modality dynamics. We introduce a novel model, termed Tensor Fusion Network, which learns both such dynamics end-to-end. The proposed approach is tailored for the volatile nature of spoken language in online videos as well as accompanying gestures and voice. In the experiments, our model outperforms state-of-the-art approaches for both multimodal and unimodal sentiment analysis.

Paper Structure

This paper contains 17 sections, 7 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Unimodal, bimodal and trimodal interaction in multimodal sentiment analysis.
  • Figure 2: Distribution of sentiment across different opinions (left) and opinion sizes (right) in CMU-MOSI.
  • Figure 3: Spoken Language Embedding Subnetwork ($\mathcal{U}_l$)
  • Figure 4: Left: Commonly used early fusion (multimodal concatenation). Right: Our proposed tensor fusion with three types of subtensors: unimodal, bimodal and trimodal.