SEER: The Span-based Emotion Evidence Retrieval Benchmark
Aneesha Sampath, Oya Aran, Emily Mower Provost
TL;DR
SEER introduces a span-based emotion evidence retrieval benchmark to assess LLMs' ability to locate exact phrases expressing emotion in real-world text. It features two tasks (single sentence and five-sentence passages) and two prompt formats (Retrieve and Highlight), evaluated on 1200 annotated sentences across 14 open-source LLMs, with a ground-truth grounding of emotion categories, valence, and spans. The evaluation uses token-level F1 and cosine similarity with span alignment via the Kuhn-Munkres algorithm, penalizing incorrect span counts to ensure precise grounding. Key findings show that while several models approach human performance on single sentences, multi-sentence contexts substantially degrade accuracy, with common error modes including keyword fixation and neutral false positives; CoT prompting helps some larger models but not uniformly. SEER provides a dataset and evaluation protocol for robustly grounding emotion in text, with potential extensions to audio and broader context-aware emotion grounding for empathetic and clinical applications.
Abstract
We introduce the SEER (Span-based Emotion Evidence Retrieval) Benchmark to test Large Language Models' (LLMs) ability to identify the specific spans of text that express emotion. Unlike traditional emotion recognition tasks that assign a single label to an entire sentence, SEER targets the underexplored task of emotion evidence detection: pinpointing which exact phrases convey emotion. This span-level approach is crucial for applications like empathetic dialogue and clinical support, which need to know how emotion is expressed, not just what the emotion is. SEER includes two tasks: identifying emotion evidence within a single sentence, and identifying evidence across a short passage of five consecutive sentences. It contains new annotations for both emotion and emotion evidence on 1200 real-world sentences. We evaluate 14 open-source LLMs and find that, while some models approach average human performance on single-sentence inputs, their accuracy degrades in longer passages. Our error analysis reveals key failure modes, including overreliance on emotion keywords and false positives in neutral text.
