LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations

Soumya Dutta; Sriram Ganapathy

LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations

Soumya Dutta, Sriram Ganapathy

TL;DR

This paper proposes to pretrain a text-based recognition model from unsupervised speech transcripts with LLM guidance, and proposes a hierarchical way of training the speech-text model, keeping in mind the conversational nature of the dataset.

Abstract

Emotion recognition in conversations (ERC) is challenging due to the multimodal nature of the emotion expression. In this paper, we propose to pretrain a text-based recognition model from unsupervised speech transcripts with LLM guidance. These transcriptions are obtained from a raw speech dataset with a pre-trained ASR system. A text LLM model is queried to provide pseudo-labels for these transcripts, and these pseudo-labeled transcripts are subsequently used for learning an utterance level text-based emotion recognition model. We use the utterance level text embeddings for emotion recognition in conversations along with speech embeddings obtained from a recently proposed pre-trained model. A hierarchical way of training the speech-text model is proposed, keeping in mind the conversational nature of the dataset. We perform experiments on three established datasets, namely, IEMOCAP, MELD, and CMU- MOSI, where we illustrate that the proposed model improves over other benchmarks and achieves state-of-the-art results on two out of these three datasets.

LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations

TL;DR

Abstract

Paper Structure (18 sections, 4 figures, 2 tables)

This paper contains 18 sections, 4 figures, 2 tables.

Introduction
Related Work
Method
Background
CARE
RoBERTa
Proposed MERITS-L model
Problem Description
LLM guided text pre-training
Training
Experiments and Results
Datasets
Implementation details
Results
Evaluation with different LLMs
...and 3 more sections

Figures (4)

Figure 1: Block diagram of the proposed model. The pre-training stage is shown in the grey box at the top. An ASR system is used to generate the transcripts for the pre-training data which are annotated by a large language model (LLM) as positive, negative or neutral sentiment. These "silver" labels with the text transcripts form the supervised training dataset for RoBERTa-large model. A frozen CARE model dutta2024leveraging is used for extracting audio embeddings. Both the text and speech embeddings thus use only unsupervised data. The MERITS-L model is trained in three stages (denoted as Stage I, II and III in the diagram), wherein the models trained in a particular stage are kept frozen for subsequent stages.
Figure 2: The co-attention network used in the proposed model. It consists of two sub-blocks - the cross-attention and the self-attention blocks.
Figure 3: The performance of the RoBERTa-large models on the different datasets. Different LLMs are used for generating pseudo emotion labels from speech transcripts. The performance of pre-trained RoBERTa without any supervised fine-tuning is also reported.
Figure 4: The importance of hierarchical training in MERITS-L

LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations

TL;DR

Abstract

LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations

Authors

TL;DR

Abstract

Table of Contents

Figures (4)