Plug-and-Play Emotion Graphs for Compositional Prompting in Zero-Shot Speech Emotion Recognition
Jiacheng Shi, Hongfei Du, Y. Alicia Hong, Ye Gao
TL;DR
The paper tackles zero-shot speech emotion recognition (SER) with large audio-language models (LALMs), where paralinguistic cues are often underutilized. It introduces Compositional Chain-of-Thought Prompting for Emotion Reasoning (CCoT-Emo), which uses a structured Emotion Graph to represent acoustic features, textual sentiment, keywords, and cross-modal relations, embedded into prompts without fine-tuning. The method includes a two-stage process: generate the Emotion Graph from audio and transcripts, then prompt the LALM with the graph to predict the emotion label. Across multiple SER benchmarks and models, CCoT-Emo yields consistent improvements over direct prompting and standard CoT, demonstrating enhanced interpretability and cross-modal reasoning without task-specific supervision.
Abstract
Large audio-language models (LALMs) exhibit strong zero-shot performance across speech tasks but struggle with speech emotion recognition (SER) due to weak paralinguistic modeling and limited cross-modal reasoning. We propose Compositional Chain-of-Thought Prompting for Emotion Reasoning (CCoT-Emo), a framework that introduces structured Emotion Graphs (EGs) to guide LALMs in emotion inference without fine-tuning. Each EG encodes seven acoustic features (e.g., pitch, speech rate, jitter, shimmer), textual sentiment, keywords, and cross-modal associations. Embedded into prompts, EGs provide interpretable and compositional representations that enhance LALM reasoning. Experiments across SER benchmarks show that CCoT-Emo outperforms prior SOTA and improves accuracy over zero-shot baselines.
