Table of Contents
Fetching ...

Prompt Amplification and Zero-Shot Late Fusion in Audio-Language Models for Speech Emotion Recognition

Saurabh Kataria, Xiao Hu

Abstract

Audio-Language Models (ALMs) are making strides in understanding speech and non-speech audio. However, domain-specialist Foundation Models (FMs) remain the best for closed-ended speech processing tasks such as Speech Emotion Recognition (SER). Using ALMs for Zero-shot SER is a popular choice, but their potential to work with specialists to achieve state-of-the-art (SOTA) performance remains unexplored. We propose ZS-Fuse, a late-fusion method that combines zero-shot emotion estimates from a dual-encoder ALM with specialist FMs. To handle ambiguity in emotions and sensitivity to prompt choice, 1) we use a simple prompt ensemble and 2) suggest a novel technique called prompt amplification, which repeats audio and text queries to discover stronger zero-shot capabilities. We demonstrate the efficacy of our technique by evaluating ZS-Fuse with three dual-encoder ALMs and two FMs, and report improvements over SOTA baselines, such as WavLM-Large, on three speech emotion recognition datasets.

Prompt Amplification and Zero-Shot Late Fusion in Audio-Language Models for Speech Emotion Recognition

Abstract

Audio-Language Models (ALMs) are making strides in understanding speech and non-speech audio. However, domain-specialist Foundation Models (FMs) remain the best for closed-ended speech processing tasks such as Speech Emotion Recognition (SER). Using ALMs for Zero-shot SER is a popular choice, but their potential to work with specialists to achieve state-of-the-art (SOTA) performance remains unexplored. We propose ZS-Fuse, a late-fusion method that combines zero-shot emotion estimates from a dual-encoder ALM with specialist FMs. To handle ambiguity in emotions and sensitivity to prompt choice, 1) we use a simple prompt ensemble and 2) suggest a novel technique called prompt amplification, which repeats audio and text queries to discover stronger zero-shot capabilities. We demonstrate the efficacy of our technique by evaluating ZS-Fuse with three dual-encoder ALMs and two FMs, and report improvements over SOTA baselines, such as WavLM-Large, on three speech emotion recognition datasets.
Paper Structure (14 sections, 2 figures, 2 tables)

This paper contains 14 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Proposed Zero-shot Fusion (ZS-Fuse) method. The speech foundation fine-tuning branch is complemented by zero-shot estimates from an audio-text dual encoder. The similarity score ($P\times E$) of pre-determined prompt templates ($P$) is computed for all emotion classes ($E$). Additionally, prompt amplification trick can be used to increase the audio length by repeating it and prompt strength by replacing "<emotion_j>" with "N instances of <emotion_j>", for further potential gains.
  • Figure 2: Zero-shot UAR of three ALMs on three datasets for 16 combinations of audio and text repeats ($[1,2,3,4]\times[1,2,3,4]$).