Table of Contents
Fetching ...

LLM-Driven Multimodal Opinion Expression Identification

Bonian Jia, Huiyao Chen, Yueheng Sun, Meishan Zhang, Min Zhang

TL;DR

This study introduces a novel multimodal OEI (MOEI) task, integrating text and speech to mirror real-world scenarios, and proposes an LLM-driven method STOEI, which combines speech and text modal to identify opinion expressions.

Abstract

Opinion Expression Identification (OEI) is essential in NLP for applications ranging from voice assistants to depression diagnosis. This study extends OEI to encompass multimodal inputs, underlining the significance of auditory cues in delivering emotional subtleties beyond the capabilities of text. We introduce a novel multimodal OEI (MOEI) task, integrating text and speech to mirror real-world scenarios. Utilizing CMU MOSEI and IEMOCAP datasets, we construct the CI-MOEI dataset. Additionally, Text-to-Speech (TTS) technology is applied to the MPQA dataset to obtain the CIM-OEI dataset. We design a template for the OEI task to take full advantage of the generative power of large language models (LLMs). Advancing further, we propose an LLM-driven method STOEI, which combines speech and text modal to identify opinion expressions. Our experiments demonstrate that MOEI significantly improves the performance while our method outperforms existing methods by 9.20\% and obtains SOTA results.

LLM-Driven Multimodal Opinion Expression Identification

TL;DR

This study introduces a novel multimodal OEI (MOEI) task, integrating text and speech to mirror real-world scenarios, and proposes an LLM-driven method STOEI, which combines speech and text modal to identify opinion expressions.

Abstract

Opinion Expression Identification (OEI) is essential in NLP for applications ranging from voice assistants to depression diagnosis. This study extends OEI to encompass multimodal inputs, underlining the significance of auditory cues in delivering emotional subtleties beyond the capabilities of text. We introduce a novel multimodal OEI (MOEI) task, integrating text and speech to mirror real-world scenarios. Utilizing CMU MOSEI and IEMOCAP datasets, we construct the CI-MOEI dataset. Additionally, Text-to-Speech (TTS) technology is applied to the MPQA dataset to obtain the CIM-OEI dataset. We design a template for the OEI task to take full advantage of the generative power of large language models (LLMs). Advancing further, we propose an LLM-driven method STOEI, which combines speech and text modal to identify opinion expressions. Our experiments demonstrate that MOEI significantly improves the performance while our method outperforms existing methods by 9.20\% and obtains SOTA results.

Paper Structure

This paper contains 14 sections, 3 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Opinion expressions in the same sentence show different emotional polarity in different speech scenarios.
  • Figure 2: The overall architecture of our method STOEI, where $\oplus$ indicates vectorial concatenation.
  • Figure 3: $\text{F}_1$ scores of Whispering-LLAMA, Vicuna(Text-only) and STOEI for different sentence lengths. The methods in the figure were all obtained by training directly on CIM-MOEI, except Vicuna(Text-only), which trained on CIM-OEI.