Table of Contents
Fetching ...

SocialNLI: A Dialogue-Centric Social Inference Dataset

Akhil Deo, Kate Sanders, Benjamin Van Durme

TL;DR

SocialNLI introduces a transcript-centric social inference dataset to probe theory-of-mind capabilities in AI, emphasizing sarcasm and irony in multi-speaker dialogues. The dataset combines human-provided plausibility judgments and explanations with open-ended inferences generated for each dialogue, augmented by a novel counterfactual reasoning framework to evaluate ToM alignment in large language and reasoning models. Extensive experiments reveal that state-of-the-art models exhibit limited alignment with human social inferences, with GPT-4o showing the only meaningful correlation and all models underperforming humans on explanation quality. SoNLI provides a diagnostic resource for developing socially aware AI and highlights the need for improved ToM reasoning, broader datasets, and better explanations in dialogue understanding. The work thus advances evaluation and training of socially aligned models, while transparently acknowledging scope and bias limitations for future research.

Abstract

Making theory-of-mind inferences from human dialogue is a strong indicator of a model's underlying social abilities, which are fundamental for adept AI assistants. However, large language and reasoning models struggle to understand sophisticated social phenomena in transcript data, such as sarcasm and irony. To assess the weaknesses of current models and to identify their solutions, we introduce SocialNLI (SoNLI) -- the first social dialogue inference dataset. SoNLI consists of a collection of dialogue transcripts hand-picked to center complex social nuances like irony and sarcasm, paired with inferences, corresponding likelihood scores, and human-written explanations. We explore social inference analysis as a facet of theory-of-mind, and evaluate LLM and reasoning model theory-of-mind ability through multi-step counterfactual reasoning.

SocialNLI: A Dialogue-Centric Social Inference Dataset

TL;DR

SocialNLI introduces a transcript-centric social inference dataset to probe theory-of-mind capabilities in AI, emphasizing sarcasm and irony in multi-speaker dialogues. The dataset combines human-provided plausibility judgments and explanations with open-ended inferences generated for each dialogue, augmented by a novel counterfactual reasoning framework to evaluate ToM alignment in large language and reasoning models. Extensive experiments reveal that state-of-the-art models exhibit limited alignment with human social inferences, with GPT-4o showing the only meaningful correlation and all models underperforming humans on explanation quality. SoNLI provides a diagnostic resource for developing socially aware AI and highlights the need for improved ToM reasoning, broader datasets, and better explanations in dialogue understanding. The work thus advances evaluation and training of socially aligned models, while transparently acknowledging scope and bias limitations for future research.

Abstract

Making theory-of-mind inferences from human dialogue is a strong indicator of a model's underlying social abilities, which are fundamental for adept AI assistants. However, large language and reasoning models struggle to understand sophisticated social phenomena in transcript data, such as sarcasm and irony. To assess the weaknesses of current models and to identify their solutions, we introduce SocialNLI (SoNLI) -- the first social dialogue inference dataset. SoNLI consists of a collection of dialogue transcripts hand-picked to center complex social nuances like irony and sarcasm, paired with inferences, corresponding likelihood scores, and human-written explanations. We explore social inference analysis as a facet of theory-of-mind, and evaluate LLM and reasoning model theory-of-mind ability through multi-step counterfactual reasoning.

Paper Structure

This paper contains 39 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: (A) LLMs fall behind humans on counterfactual reasoning over complex dialogue snippets. (B) SocialNLI dataset contents.
  • Figure 2: Human-judged factuality of explanations. For each inference, models produce supporting and opposing explanations, and humans score the accuracy of those explanations. LLMs are blue; LRMs are orange. The red horizontal line shows human baseline performance (92.59%), exceeding all model accuracies.
  • Figure 3: Inference type distribution.