Tensor Fusion Network for Multimodal Sentiment Analysis
Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, Louis-Philippe Morency
TL;DR
The paper tackles multimodal sentiment analysis by modeling intra-modality and inter-modality dynamics across language, visual, and acoustic signals. It introduces the Tensor Fusion Network (TFN) with Modality Embedding Subnetworks, a Tensor Fusion Layer that explicitly captures unimodal, bimodal, and trimodal interactions, and a Sentiment Inference Subnetwork for diverse output tasks. Empirical results on the CMU-MOSI dataset demonstrate state-of-the-art performance for multimodal sentiment analysis and for each unimodal modality, with ablations underscoring the value of trimodal dynamics. Qualitative analyses further show TFN's ability to leverage cross-modal cues to resolve sentiment that language alone cannot determine.
Abstract
Multimodal sentiment analysis is an increasingly popular research area, which extends the conventional language-based definition of sentiment analysis to a multimodal setup where other relevant modalities accompany language. In this paper, we pose the problem of multimodal sentiment analysis as modeling intra-modality and inter-modality dynamics. We introduce a novel model, termed Tensor Fusion Network, which learns both such dynamics end-to-end. The proposed approach is tailored for the volatile nature of spoken language in online videos as well as accompanying gestures and voice. In the experiments, our model outperforms state-of-the-art approaches for both multimodal and unimodal sentiment analysis.
