Context and System Fusion in Post-ASR Emotion Recognition with Large Language Models

Pavel Stepachev; Pinzhen Chen; Barry Haddow

Context and System Fusion in Post-ASR Emotion Recognition with Large Language Models

Pavel Stepachev, Pinzhen Chen, Barry Haddow

TL;DR

It is shown that the conversation context has diminishing returns and the metric used to select the transcript for prediction is crucial, and the best submission surpasses the provided baseline by 20% in absolute accuracy.

Abstract

Large language models (LLMs) have started to play a vital role in modelling speech and text. To explore the best use of context and multiple systems' outputs for post-ASR speech emotion prediction, we study LLM prompting on a recent task named GenSEC. Our techniques include ASR transcript ranking, variable conversation context, and system output fusion. We show that the conversation context has diminishing returns and the metric used to select the transcript for prediction is crucial. Finally, our best submission surpasses the provided baseline by 20% in absolute accuracy.

Context and System Fusion in Post-ASR Emotion Recognition with Large Language Models

TL;DR

Abstract

Paper Structure (18 sections, 1 equation, 4 figures, 5 tables)

This paper contains 18 sections, 1 equation, 4 figures, 5 tables.

INTRODUCTION
DATA AND EVALUATION
CONTEXT SELECTION METHODOLOGY
Ranking
Naive selection heuristics
The use of conversation context
ASR output fusion
EXPERIMENTS AND RESULTS
Setup
Results
Comparison between GPT models
Effects of the conversation context
ASR output ranking
ASR-context fusion
Final results
...and 3 more sections

Figures (4)

Figure 1: Prompt template with context size 2 with the last utterance needing emotion prediction.
Figure 2: Prompt template with a context size 4 as well as 5 ASR outputs as a means of fusion.
Figure 3: Performance of ranking metrics with various context sizes on gpt-4o.
Figure 4: Performance of naive heuristics metrics with various context sizes on gpt-4o.

Context and System Fusion in Post-ASR Emotion Recognition with Large Language Models

TL;DR

Abstract

Context and System Fusion in Post-ASR Emotion Recognition with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)