Table of Contents
Fetching ...

SEER: The Span-based Emotion Evidence Retrieval Benchmark

Aneesha Sampath, Oya Aran, Emily Mower Provost

TL;DR

SEER introduces a span-based emotion evidence retrieval benchmark to assess LLMs' ability to locate exact phrases expressing emotion in real-world text. It features two tasks (single sentence and five-sentence passages) and two prompt formats (Retrieve and Highlight), evaluated on 1200 annotated sentences across 14 open-source LLMs, with a ground-truth grounding of emotion categories, valence, and spans. The evaluation uses token-level F1 and cosine similarity with span alignment via the Kuhn-Munkres algorithm, penalizing incorrect span counts to ensure precise grounding. Key findings show that while several models approach human performance on single sentences, multi-sentence contexts substantially degrade accuracy, with common error modes including keyword fixation and neutral false positives; CoT prompting helps some larger models but not uniformly. SEER provides a dataset and evaluation protocol for robustly grounding emotion in text, with potential extensions to audio and broader context-aware emotion grounding for empathetic and clinical applications.

Abstract

We introduce the SEER (Span-based Emotion Evidence Retrieval) Benchmark to test Large Language Models' (LLMs) ability to identify the specific spans of text that express emotion. Unlike traditional emotion recognition tasks that assign a single label to an entire sentence, SEER targets the underexplored task of emotion evidence detection: pinpointing which exact phrases convey emotion. This span-level approach is crucial for applications like empathetic dialogue and clinical support, which need to know how emotion is expressed, not just what the emotion is. SEER includes two tasks: identifying emotion evidence within a single sentence, and identifying evidence across a short passage of five consecutive sentences. It contains new annotations for both emotion and emotion evidence on 1200 real-world sentences. We evaluate 14 open-source LLMs and find that, while some models approach average human performance on single-sentence inputs, their accuracy degrades in longer passages. Our error analysis reveals key failure modes, including overreliance on emotion keywords and false positives in neutral text.

SEER: The Span-based Emotion Evidence Retrieval Benchmark

TL;DR

SEER introduces a span-based emotion evidence retrieval benchmark to assess LLMs' ability to locate exact phrases expressing emotion in real-world text. It features two tasks (single sentence and five-sentence passages) and two prompt formats (Retrieve and Highlight), evaluated on 1200 annotated sentences across 14 open-source LLMs, with a ground-truth grounding of emotion categories, valence, and spans. The evaluation uses token-level F1 and cosine similarity with span alignment via the Kuhn-Munkres algorithm, penalizing incorrect span counts to ensure precise grounding. Key findings show that while several models approach human performance on single sentences, multi-sentence contexts substantially degrade accuracy, with common error modes including keyword fixation and neutral false positives; CoT prompting helps some larger models but not uniformly. SEER provides a dataset and evaluation protocol for robustly grounding emotion in text, with potential extensions to audio and broader context-aware emotion grounding for empathetic and clinical applications.

Abstract

We introduce the SEER (Span-based Emotion Evidence Retrieval) Benchmark to test Large Language Models' (LLMs) ability to identify the specific spans of text that express emotion. Unlike traditional emotion recognition tasks that assign a single label to an entire sentence, SEER targets the underexplored task of emotion evidence detection: pinpointing which exact phrases convey emotion. This span-level approach is crucial for applications like empathetic dialogue and clinical support, which need to know how emotion is expressed, not just what the emotion is. SEER includes two tasks: identifying emotion evidence within a single sentence, and identifying evidence across a short passage of five consecutive sentences. It contains new annotations for both emotion and emotion evidence on 1200 real-world sentences. We evaluate 14 open-source LLMs and find that, while some models approach average human performance on single-sentence inputs, their accuracy degrades in longer passages. Our error analysis reveals key failure modes, including overreliance on emotion keywords and false positives in neutral text.

Paper Structure

This paper contains 48 sections, 1 equation, 6 figures, 13 tables.

Figures (6)

  • Figure 1: SEER includes two tasks: single- and multi-sentence emotion evidence identification. Each has two prompt formats: Retrieve (extract exact spans) and Highlight (mark spans in context). Task objectives are identical across formats. The text is truncated in the figure for space, but not in actual LLM input/output.
  • Figure 2: Emotion transitions between adjacent sentences for Task 2.
  • Figure 3: Task 1 (Retrieve-Base). (a) Per-emotion F1 scores. (b) Per-valence F1 scores.
  • Figure 4: Task 1 (Highlight-Base). (a) Per-emotion F1 scores. (b) Per-valence F1 scores.
  • Figure 5: Emotion category errors in Task 2 Retrieve-Base.
  • ...and 1 more figures