Table of Contents
Fetching ...

Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content

Sheng Wu, Xiaobao Wang, Longbiao Wang, Dongxiao He, Jianwu Dang

TL;DR

This work tackles the challenge of fine-grained emotion discrimination in multimodal sentiment analysis by enriching audio-visual content with textual emotional descriptions. The proposed DEVA framework introduces an Emotional Description Generator to convert audio and facial cues into descriptive text, and a Text-Guided Progressive Fusion (TPF) module that uses text as the core modality to progressively fuse minor modalities. Through unimodal coding, description augmentation, and cross-modal fusion, DEVA achieves state-of-the-art results on MOSI, MOSEI, and CH-SIMS, with ablations confirming the importance of EDG, MFU, CEU, and the fusion strategy. The approach demonstrates robust sensitivity to subtle emotional variations and offers a new direction for leveraging textual representations to bridge multimodal gaps in sentiment analysis.

Abstract

Multimodal Sentiment Analysis (MSA) stands as a critical research frontier, seeking to comprehensively unravel human emotions by amalgamating text, audio, and visual data. Yet, discerning subtle emotional nuances within audio and video expressions poses a formidable challenge, particularly when emotional polarities across various segments appear similar. In this paper, our objective is to spotlight emotion-relevant attributes of audio and visual modalities to facilitate multimodal fusion in the context of nuanced emotional shifts in visual-audio scenarios. To this end, we introduce DEVA, a progressive fusion framework founded on textual sentiment descriptions aimed at accentuating emotional features of visual-audio content. DEVA employs an Emotional Description Generator (EDG) to transmute raw audio and visual data into textualized sentiment descriptions, thereby amplifying their emotional characteristics. These descriptions are then integrated with the source data to yield richer, enhanced features. Furthermore, DEVA incorporates the Text-guided Progressive Fusion Module (TPF), leveraging varying levels of text as a core modality guide. This module progressively fuses visual-audio minor modalities to alleviate disparities between text and visual-audio modalities. Experimental results on widely used sentiment analysis benchmark datasets, including MOSI, MOSEI, and CH-SIMS, underscore significant enhancements compared to state-of-the-art models. Moreover, fine-grained emotion experiments corroborate the robust sensitivity of DEVA to subtle emotional variations.

Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content

TL;DR

This work tackles the challenge of fine-grained emotion discrimination in multimodal sentiment analysis by enriching audio-visual content with textual emotional descriptions. The proposed DEVA framework introduces an Emotional Description Generator to convert audio and facial cues into descriptive text, and a Text-Guided Progressive Fusion (TPF) module that uses text as the core modality to progressively fuse minor modalities. Through unimodal coding, description augmentation, and cross-modal fusion, DEVA achieves state-of-the-art results on MOSI, MOSEI, and CH-SIMS, with ablations confirming the importance of EDG, MFU, CEU, and the fusion strategy. The approach demonstrates robust sensitivity to subtle emotional variations and offers a new direction for leveraging textual representations to bridge multimodal gaps in sentiment analysis.

Abstract

Multimodal Sentiment Analysis (MSA) stands as a critical research frontier, seeking to comprehensively unravel human emotions by amalgamating text, audio, and visual data. Yet, discerning subtle emotional nuances within audio and video expressions poses a formidable challenge, particularly when emotional polarities across various segments appear similar. In this paper, our objective is to spotlight emotion-relevant attributes of audio and visual modalities to facilitate multimodal fusion in the context of nuanced emotional shifts in visual-audio scenarios. To this end, we introduce DEVA, a progressive fusion framework founded on textual sentiment descriptions aimed at accentuating emotional features of visual-audio content. DEVA employs an Emotional Description Generator (EDG) to transmute raw audio and visual data into textualized sentiment descriptions, thereby amplifying their emotional characteristics. These descriptions are then integrated with the source data to yield richer, enhanced features. Furthermore, DEVA incorporates the Text-guided Progressive Fusion Module (TPF), leveraging varying levels of text as a core modality guide. This module progressively fuses visual-audio minor modalities to alleviate disparities between text and visual-audio modalities. Experimental results on widely used sentiment analysis benchmark datasets, including MOSI, MOSEI, and CH-SIMS, underscore significant enhancements compared to state-of-the-art models. Moreover, fine-grained emotion experiments corroborate the robust sensitivity of DEVA to subtle emotional variations.

Paper Structure

This paper contains 35 sections, 13 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: The illustration of our motivation is as follows: The text transcription of the audio is indicated within the blue box, while the visual emotional description is within the orange box. The teal highlighting indicates highly emotionally relevant descriptions, and teal circles are used to mark the corresponding microexpressions in the source data.
  • Figure 2: The overall architecture of DEVA consists of unimodal coding, an emotional description generator, feature enhancement, text-guided progressive fusion, and multimodal fusion.
  • Figure 3: The architecture of TPF.
  • Figure 4: The impact of the number of Action Units on performance.
  • Figure 5: The impact of TPF depth on performance.
  • ...and 2 more figures