Table of Contents
Fetching ...

OmniVox: Zero-Shot Emotion Recognition with Omni-LLMs

John Murzaku, Owen Rambow

TL;DR

OmniVox investigates zero-shot emotion recognition using omni-LLMs that accept multiple modalities, focusing on audio-to-text emotion labeling. The approach prompts four omni-LLMs with acoustic-aware instructions, including context and step-by-step reasoning, and analyzes the impact of context windows and acoustic descriptions on ERC performance on IEMOCAP and MELD. Key findings show that acoustic prompting improves performance across models and that context helps for IEMOCAP, with some models approaching or surpassing fine-tuned baselines, though results vary by corpus and modality. The paper also provides a detailed error analysis linking misclassifications to mismatches in acoustic feature descriptions, highlighting both the promise and limitations of zero-shot omni-LLMs for practical ERC tasks.

Abstract

The use of omni-LLMs (large language models that accept any modality as input), particularly for multimodal cognitive state tasks involving speech, is understudied. We present OmniVox, the first systematic evaluation of four omni-LLMs on the zero-shot emotion recognition task. We evaluate on two widely used multimodal emotion benchmarks: IEMOCAP and MELD, and find zero-shot omni-LLMs outperform or are competitive with fine-tuned audio models. Alongside our audio-only evaluation, we also evaluate omni-LLMs on text only and text and audio. We present acoustic prompting, an audio-specific prompting strategy for omni-LLMs which focuses on acoustic feature analysis, conversation context analysis, and step-by-step reasoning. We compare our acoustic prompting to minimal prompting and full chain-of-thought prompting techniques. We perform a context window analysis on IEMOCAP and MELD, and find that using context helps, especially on IEMOCAP. We conclude with an error analysis on the generated acoustic reasoning outputs from the omni-LLMs.

OmniVox: Zero-Shot Emotion Recognition with Omni-LLMs

TL;DR

OmniVox investigates zero-shot emotion recognition using omni-LLMs that accept multiple modalities, focusing on audio-to-text emotion labeling. The approach prompts four omni-LLMs with acoustic-aware instructions, including context and step-by-step reasoning, and analyzes the impact of context windows and acoustic descriptions on ERC performance on IEMOCAP and MELD. Key findings show that acoustic prompting improves performance across models and that context helps for IEMOCAP, with some models approaching or surpassing fine-tuned baselines, though results vary by corpus and modality. The paper also provides a detailed error analysis linking misclassifications to mismatches in acoustic feature descriptions, highlighting both the promise and limitations of zero-shot omni-LLMs for practical ERC tasks.

Abstract

The use of omni-LLMs (large language models that accept any modality as input), particularly for multimodal cognitive state tasks involving speech, is understudied. We present OmniVox, the first systematic evaluation of four omni-LLMs on the zero-shot emotion recognition task. We evaluate on two widely used multimodal emotion benchmarks: IEMOCAP and MELD, and find zero-shot omni-LLMs outperform or are competitive with fine-tuned audio models. Alongside our audio-only evaluation, we also evaluate omni-LLMs on text only and text and audio. We present acoustic prompting, an audio-specific prompting strategy for omni-LLMs which focuses on acoustic feature analysis, conversation context analysis, and step-by-step reasoning. We compare our acoustic prompting to minimal prompting and full chain-of-thought prompting techniques. We perform a context window analysis on IEMOCAP and MELD, and find that using context helps, especially on IEMOCAP. We conclude with an error analysis on the generated acoustic reasoning outputs from the omni-LLMs.

Paper Structure

This paper contains 25 sections, 1 figure, 6 tables.

Figures (1)

  • Figure 1: The proposed OmniVox framework. We perform zero-shot emotion recognition from audio inputs enhanced by text instructions, and optional contextual information or transcripts. We then generate a context analysis, acoustic feature interpretation, and a final chain-of-thought reasoning, ultimately predicting a specific emotion label (e.g., sad in this example).