Table of Contents
Fetching ...

ADEPT: RL-Aligned Agentic Decoding of Emotion via Evidence Probing Tools -- From Consensus Learning to Ambiguity-Driven Emotion Reasoning

Esther Sun, Bo-Hao Su, Abinay Reddy Naini, Shinji Watanabe, Carlos Busso

TL;DR

ADEPT redefines speech emotion recognition as an ambiguity-aware reasoning task, replacing one-shot predictions with a multi-turn agent that actively probes semantic and acoustic evidence. By integrating Explicit Information Retrieval and a GRPO-based training regime with an Evidence Trust Gate, ADEPT achieves auditable, evidence-grounded predictions and better recovery of co-occurring minor emotions on MSP-Podcast. The framework preserves minority annotations as informative supervision and demonstrates robust zero-shot generalization to IEMOCAP, indicating resilience to domain shift. Collectively, ADEPT advances interpretable, auditable affective computing by coupling structured tool-based reasoning with principled optimization to mitigate confirmation bias and reward-hacking concerns.

Abstract

Speech Large Language Models (SLLMs) enable high-level emotion reasoning but often produce ungrounded, text-biased judgments without verifiable acoustic evidence. In contrast, self-supervised speech encoders such as WavLM provide strong acoustic representations yet remain opaque discriminative models with limited interpretability. To bridge this gap, we introduce ADEPT (Agentic Decoding of Emotion via Evidence Probing Tools), a framework that reframes emotion recognition as a multi-turn inquiry process rather than a single-pass prediction. ADEPT transforms an SLLM into an agent that maintains an evolving candidate emotion set and adaptively invokes dedicated semantic and acoustic probing tools within a structured pipeline of candidate generation, evidence collection, and adjudication. Crucially, ADEPT enables a paradigm shift from consensus learning to ambiguity-driven emotion reasoning. Since human affect exhibits inherent complexity and frequent co-occurrence of emotions, we treat minority annotations as informative perceptual signals rather than discarding them as noise. Finally, we integrate Group Relative Policy Optimization (GRPO) with an Evidence Trust Gate to explicitly couple tool-usage behaviors with prediction quality and enforce evidence-grounded reasoning. Experiments show that ADEPT improves primary emotion accuracy in most settings while substantially improving minor emotion characterization, producing explanations grounded in auditable acoustic and semantic evidence.

ADEPT: RL-Aligned Agentic Decoding of Emotion via Evidence Probing Tools -- From Consensus Learning to Ambiguity-Driven Emotion Reasoning

TL;DR

ADEPT redefines speech emotion recognition as an ambiguity-aware reasoning task, replacing one-shot predictions with a multi-turn agent that actively probes semantic and acoustic evidence. By integrating Explicit Information Retrieval and a GRPO-based training regime with an Evidence Trust Gate, ADEPT achieves auditable, evidence-grounded predictions and better recovery of co-occurring minor emotions on MSP-Podcast. The framework preserves minority annotations as informative supervision and demonstrates robust zero-shot generalization to IEMOCAP, indicating resilience to domain shift. Collectively, ADEPT advances interpretable, auditable affective computing by coupling structured tool-based reasoning with principled optimization to mitigate confirmation bias and reward-hacking concerns.

Abstract

Speech Large Language Models (SLLMs) enable high-level emotion reasoning but often produce ungrounded, text-biased judgments without verifiable acoustic evidence. In contrast, self-supervised speech encoders such as WavLM provide strong acoustic representations yet remain opaque discriminative models with limited interpretability. To bridge this gap, we introduce ADEPT (Agentic Decoding of Emotion via Evidence Probing Tools), a framework that reframes emotion recognition as a multi-turn inquiry process rather than a single-pass prediction. ADEPT transforms an SLLM into an agent that maintains an evolving candidate emotion set and adaptively invokes dedicated semantic and acoustic probing tools within a structured pipeline of candidate generation, evidence collection, and adjudication. Crucially, ADEPT enables a paradigm shift from consensus learning to ambiguity-driven emotion reasoning. Since human affect exhibits inherent complexity and frequent co-occurrence of emotions, we treat minority annotations as informative perceptual signals rather than discarding them as noise. Finally, we integrate Group Relative Policy Optimization (GRPO) with an Evidence Trust Gate to explicitly couple tool-usage behaviors with prediction quality and enforce evidence-grounded reasoning. Experiments show that ADEPT improves primary emotion accuracy in most settings while substantially improving minor emotion characterization, producing explanations grounded in auditable acoustic and semantic evidence.
Paper Structure (75 sections, 19 equations, 12 figures, 9 tables)

This paper contains 75 sections, 19 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Illustration of our flexible label construction strategy for the annotations in the MSP-Podcast corpus, preserving both tied primary emotions and minority-vote minor emotions.
  • Figure 2: ADEPT three-phase inference workflow with tool-mediated evidence probing.Phase 1 (Global Perception) performs high-recall hypothesis initialization from audio and transcript, producing a ranked candidate set, tie prediction, and coarse reasoning. Phase 2 (Evidence Verification) executes adaptive evidence accumulation under a soft budget, combining four complementary tool families: (i) STRUCTURAL PRIOR for budget allocation and fallback candidate scheduling; (ii) SEMANTIC PROBING for span-level verification and pairwise disambiguation of confusable emotions; (iii) ACOUSTIC PROBING for emotion-neutral, localized signal measurements via a coarse-to-fine routine (locate$\rightarrow$analyze$\rightarrow$compare); and (iv) REFINEMENT for closed-loop re-checking (e.g., replay with updated focus points) when evidence is conflicting or insufficient. Phase 3 (Decision Synthesis) performs evidence-closed adjudication, prohibiting further tool calls and synthesizing only Phase-2 retrieved observations to produce standardized outputs (primary emotions, minor emotions, and overall evidence-grounded reasoning).
  • Figure 3: ADEPT three-phase inference protocol (illustrated example). Phase 1 initializes a high-recall candidate set and preserves uncertainty (e.g., a predicted tie). Phase 2 collects auditable semantic spans and localized acoustic evidence via tool calls to support or reject hypotheses. Phase 3 performs evidence-closed adjudication, overriding ties and outputting the final primary/minor emotions with an explicit audit trail.
  • Figure 4: Tool usage scales with annotation ambiguity. (a) Average tool calls increase from 2.3 (high consensus) to 6.8 (low consensus). (b) Low-consensus samples show elevated usage across all tools, especially replay_audio and StructuralPrior(expand).
  • Figure 5: Dataset overview of the MSP-Podcast corpus (v2.0) and annotation ambiguity. We summarize (a) the annotator consensus level distribution (high: 1 label; medium: 2--3 labels; low: 4+ labels), (b) the primary emotion distribution, (c) the tie rate conditioned on the primary emotion, and (d) the distribution of minor emotion counts for both Train and Test1 splits. These statistics highlight the prevalence of disagreement and multi-label ambiguity in naturalistic emotional speech, motivating ambiguity-driven emotion reasoning.
  • ...and 7 more figures