OmniVox: Zero-Shot Emotion Recognition with Omni-LLMs
John Murzaku, Owen Rambow
TL;DR
OmniVox investigates zero-shot emotion recognition using omni-LLMs that accept multiple modalities, focusing on audio-to-text emotion labeling. The approach prompts four omni-LLMs with acoustic-aware instructions, including context and step-by-step reasoning, and analyzes the impact of context windows and acoustic descriptions on ERC performance on IEMOCAP and MELD. Key findings show that acoustic prompting improves performance across models and that context helps for IEMOCAP, with some models approaching or surpassing fine-tuned baselines, though results vary by corpus and modality. The paper also provides a detailed error analysis linking misclassifications to mismatches in acoustic feature descriptions, highlighting both the promise and limitations of zero-shot omni-LLMs for practical ERC tasks.
Abstract
The use of omni-LLMs (large language models that accept any modality as input), particularly for multimodal cognitive state tasks involving speech, is understudied. We present OmniVox, the first systematic evaluation of four omni-LLMs on the zero-shot emotion recognition task. We evaluate on two widely used multimodal emotion benchmarks: IEMOCAP and MELD, and find zero-shot omni-LLMs outperform or are competitive with fine-tuned audio models. Alongside our audio-only evaluation, we also evaluate omni-LLMs on text only and text and audio. We present acoustic prompting, an audio-specific prompting strategy for omni-LLMs which focuses on acoustic feature analysis, conversation context analysis, and step-by-step reasoning. We compare our acoustic prompting to minimal prompting and full chain-of-thought prompting techniques. We perform a context window analysis on IEMOCAP and MELD, and find that using context helps, especially on IEMOCAP. We conclude with an error analysis on the generated acoustic reasoning outputs from the omni-LLMs.
