Table of Contents
Fetching ...

Select-Additive Learning: Improving Generalization in Multimodal Sentiment Analysis

Haohan Wang, Aaksha Meghawat, Louis-Philippe Morency, Eric P. Xing

TL;DR

This work tackles generalization in multimodal sentiment analysis under data scarcity by introducing Select-Additive Learning (SAL), a two-phase method that first identifies identity-related confounding dimensions in latent representations and then suppresses their influence by adding Gaussian noise during training. SAL extends a CNN-based classifier with an auxiliary predictor and a Gaussian Sampling Layer, optimizing a selection loss with sparsity to locate confounds and a subsequent addition loss to encourage reliance on non-confounding features. Across MOSI, YouTube, and MOUD datasets, SAL improves prediction accuracy across verbal, acoustic, visual modalities, and their fusion, with statistically significant gains in cross-dataset tests, demonstrating enhanced robustness to speaker identity and dataset shifts.

Abstract

Multimodal sentiment analysis is drawing an increasing amount of attention these days. It enables mining of opinions in video reviews which are now available aplenty on online platforms. However, multimodal sentiment analysis has only a few high-quality data sets annotated for training machine learning algorithms. These limited resources restrict the generalizability of models, where, for example, the unique characteristics of a few speakers (e.g., wearing glasses) may become a confounding factor for the sentiment classification task. In this paper, we propose a Select-Additive Learning (SAL) procedure that improves the generalizability of trained neural networks for multimodal sentiment analysis. In our experiments, we show that our SAL approach improves prediction accuracy significantly in all three modalities (verbal, acoustic, visual), as well as in their fusion. Our results show that SAL, even when trained on one dataset, achieves good generalization across two new test datasets.

Select-Additive Learning: Improving Generalization in Multimodal Sentiment Analysis

TL;DR

This work tackles generalization in multimodal sentiment analysis under data scarcity by introducing Select-Additive Learning (SAL), a two-phase method that first identifies identity-related confounding dimensions in latent representations and then suppresses their influence by adding Gaussian noise during training. SAL extends a CNN-based classifier with an auxiliary predictor and a Gaussian Sampling Layer, optimizing a selection loss with sparsity to locate confounds and a subsequent addition loss to encourage reliance on non-confounding features. Across MOSI, YouTube, and MOUD datasets, SAL improves prediction accuracy across verbal, acoustic, visual modalities, and their fusion, with statistically significant gains in cross-dataset tests, demonstrating enhanced robustness to speaker identity and dataset shifts.

Abstract

Multimodal sentiment analysis is drawing an increasing amount of attention these days. It enables mining of opinions in video reviews which are now available aplenty on online platforms. However, multimodal sentiment analysis has only a few high-quality data sets annotated for training machine learning algorithms. These limited resources restrict the generalizability of models, where, for example, the unique characteristics of a few speakers (e.g., wearing glasses) may become a confounding factor for the sentiment classification task. In this paper, we propose a Select-Additive Learning (SAL) procedure that improves the generalizability of trained neural networks for multimodal sentiment analysis. In our experiments, we show that our SAL approach improves prediction accuracy significantly in all three modalities (verbal, acoustic, visual), as well as in their fusion. Our results show that SAL, even when trained on one dataset, achieves good generalization across two new test datasets.

Paper Structure

This paper contains 17 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: An illustrative data set demonstrating the "wearing glass" as a confounding factor. Due to the limited amount of data, the model learns that wearing glasses means negative sentiment, which is only applicable to this training data set. (Orange denotes negative sentiment; green denotes positive sentiment; blue denotes correct rules & red denotes incorrect rules).
  • Figure 2: The SAL architecture is achieved by a simple extension of a general deep learning discriminative classifier. The purple part is the original deep learning model. The red part is the extension SAL introduces. The extension network is connected to the original network via a Gaussian Sampling Layer.
  • Figure 3: Illustration of SAL. On the left, network structure and training objective is presented. On the right, circles denote neurons. Squares denote dimensions of representation.
  • Figure 4: Confounding factors identified in the Selection phase for first 50 utterances (rows), first 100 representation values (columns) in the training set.