Table of Contents
Fetching ...

CEI: A Benchmark for Evaluating Pragmatic Reasoning in Language Models

Jon Chun, Hannah Sussman, Adrian Mangine, Murathan Kocaman, Kirill Sidorko, Abhigya Koirala, Andre McCloud, Gwen Eisenbeis, Wisdom Akanwe, Moustapha Gassama, Eliezer Gonzalez Chirinos, Anne-Duncan Enright, Peter Dunson, Tiffanie Ng, Anna von Rosenstiel, Godwin Idowu

TL;DR

The Contextual Emotional Inference Benchmark is presented: 300 human-validated scenarios for evaluating how well LLMs disambiguate pragmatically complex utterances and the annotation methodology is described, including a 4-level quality control pipeline that combines automated statistical checks with expert adjudication.

Abstract

Pragmatic reasoning, inferring intended meaning beyond literal semantics, underpins everyday communication yet remains difficult for large language models. We present the Contextual Emotional Inference (CEI) Benchmark: 300 human-validated scenarios for evaluating how well LLMs disambiguate pragmatically complex utterances. Each scenario pairs a situational context and speaker-listener roles (with explicit power relations) against an ambiguous utterance. The dataset covers five pragmatic subtypes (sarcasm/irony, mixed signals, strategic politeness, passive aggression, deflection/misdirection) drawn from workplace, family, social, and service settings, with three power configurations (peer, higher-to-lower, lower-to-higher). Three trained annotators independently labeled every scenario. Inter-annotator agreement (Fleiss' kappa = 0.06-0.25 by subtype) is low but expected: pragmatic inference admits multiple valid readings, and the disagreement itself is informative. We describe our annotation methodology, including a 4-level quality control pipeline that combines automated statistical checks with expert adjudication. CEI is released under CC-BY-4.0.

CEI: A Benchmark for Evaluating Pragmatic Reasoning in Language Models

TL;DR

The Contextual Emotional Inference Benchmark is presented: 300 human-validated scenarios for evaluating how well LLMs disambiguate pragmatically complex utterances and the annotation methodology is described, including a 4-level quality control pipeline that combines automated statistical checks with expert adjudication.

Abstract

Pragmatic reasoning, inferring intended meaning beyond literal semantics, underpins everyday communication yet remains difficult for large language models. We present the Contextual Emotional Inference (CEI) Benchmark: 300 human-validated scenarios for evaluating how well LLMs disambiguate pragmatically complex utterances. Each scenario pairs a situational context and speaker-listener roles (with explicit power relations) against an ambiguous utterance. The dataset covers five pragmatic subtypes (sarcasm/irony, mixed signals, strategic politeness, passive aggression, deflection/misdirection) drawn from workplace, family, social, and service settings, with three power configurations (peer, higher-to-lower, lower-to-higher). Three trained annotators independently labeled every scenario. Inter-annotator agreement (Fleiss' kappa = 0.06-0.25 by subtype) is low but expected: pragmatic inference admits multiple valid readings, and the disagreement itself is informative. We describe our annotation methodology, including a 4-level quality control pipeline that combines automated statistical checks with expert adjudication. CEI is released under CC-BY-4.0.
Paper Structure (84 sections, 10 figures, 10 tables)

This paper contains 84 sections, 10 figures, 10 tables.

Figures (10)

  • Figure 1: The CEI annotation task. A Speaker (S) directs an utterance to a Listener (L) within a shared social context with an explicit power relation. An external Annotator (A) observes the exchange and infers the speaker's primary emotion (from Plutchik's 8 categories), VAD ratings, and their own confidence.
  • Figure 2: Label Studio annotation interface. Annotators viewed the situational context, speaker/listener roles, and utterance, then selected a primary emotion from Plutchik's 8 categories and provided Valence-Arousal-Dominance ratings on 7-point scales.
  • Figure 3: Quality control pipeline. Annotations pass through four stages: schema checks, statistical outlier detection, inter-annotator agreement analysis, and expert adjudication of flagged items.
  • Figure 4: Scenario difficulty distribution based on annotator agreement. (a) Overall: 14% unanimous, 54% majority, 31% split. (b) By subtype: sarcasm/irony shows highest agreement; deflection shows lowest.
  • Figure 5: Human annotator confusion matrix for Plutchik's 8 emotions. Values show row-normalized proportions (how often Annotator 1's choice co-occurs with Annotator 2's). Diagonal represents agreement. Anger-surprise and sadness-surprise show highest confusion.
  • ...and 5 more figures