Table of Contents
Fetching ...

SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models

Zixiang Xu, Yanbo Wang, Yue Huang, Jiayi Ye, Haomin Zhuang, Zirui Song, Lang Gao, Chenxi Wang, Zhaorun Chen, Yujun Zhou, Sixian Li, Wang Pan, Yue Zhao, Jieyu Zhao, Xiangliang Zhang, Xiuying Chen

TL;DR

SocialMaze presents a time-aware, graph-based benchmark to systematically evaluate social reasoning in LLMs across deep reasoning, dynamic interaction, and information uncertainty. It defines six tasks across social games, daily-life interactions, and digital platforms, each constructed from layered graphs and validated with automated and human checks. Experiments show that long chain-of-thought reasoning enhances performance on deep inference tasks, dynamic interaction effects vary by task, and information uncertainty significantly challenges models; targeted fine-tuning on curated reasoning traces yields substantial improvements. The benchmark and findings offer a structured pathway to advance LLMs toward more robust, context-aware social reasoning in real-world scenarios.

Abstract

Large language models (LLMs) are increasingly applied to socially grounded tasks, such as online community moderation, media content analysis, and social reasoning games. Success in these contexts depends on a model's social reasoning ability - the capacity to interpret social contexts, infer others' mental states, and assess the truthfulness of presented information. However, there is currently no systematic evaluation framework that comprehensively assesses the social reasoning capabilities of LLMs. Existing efforts often oversimplify real-world scenarios and consist of tasks that are too basic to challenge advanced models. To address this gap, we introduce SocialMaze, a new benchmark specifically designed to evaluate social reasoning. SocialMaze systematically incorporates three core challenges: deep reasoning, dynamic interaction, and information uncertainty. It provides six diverse tasks across three key settings: social reasoning games, daily-life interactions, and digital community platforms. Both automated and human validation are used to ensure data quality. Our evaluation reveals several key insights: models vary substantially in their ability to handle dynamic interactions and integrate temporally evolving information; models with strong chain-of-thought reasoning perform better on tasks requiring deeper inference beyond surface-level cues; and model reasoning degrades significantly under uncertainty. Furthermore, we show that targeted fine-tuning on curated reasoning examples can greatly improve model performance in complex social scenarios. The dataset is publicly available at: https://huggingface.co/datasets/MBZUAI/SocialMaze

SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models

TL;DR

SocialMaze presents a time-aware, graph-based benchmark to systematically evaluate social reasoning in LLMs across deep reasoning, dynamic interaction, and information uncertainty. It defines six tasks across social games, daily-life interactions, and digital platforms, each constructed from layered graphs and validated with automated and human checks. Experiments show that long chain-of-thought reasoning enhances performance on deep inference tasks, dynamic interaction effects vary by task, and information uncertainty significantly challenges models; targeted fine-tuning on curated reasoning traces yields substantial improvements. The benchmark and findings offer a structured pathway to advance LLMs toward more robust, context-aware social reasoning in real-world scenarios.

Abstract

Large language models (LLMs) are increasingly applied to socially grounded tasks, such as online community moderation, media content analysis, and social reasoning games. Success in these contexts depends on a model's social reasoning ability - the capacity to interpret social contexts, infer others' mental states, and assess the truthfulness of presented information. However, there is currently no systematic evaluation framework that comprehensively assesses the social reasoning capabilities of LLMs. Existing efforts often oversimplify real-world scenarios and consist of tasks that are too basic to challenge advanced models. To address this gap, we introduce SocialMaze, a new benchmark specifically designed to evaluate social reasoning. SocialMaze systematically incorporates three core challenges: deep reasoning, dynamic interaction, and information uncertainty. It provides six diverse tasks across three key settings: social reasoning games, daily-life interactions, and digital community platforms. Both automated and human validation are used to ensure data quality. Our evaluation reveals several key insights: models vary substantially in their ability to handle dynamic interactions and integrate temporally evolving information; models with strong chain-of-thought reasoning perform better on tasks requiring deeper inference beyond surface-level cues; and model reasoning degrades significantly under uncertainty. Furthermore, we show that targeted fine-tuning on curated reasoning examples can greatly improve model performance in complex social scenarios. The dataset is publicly available at: https://huggingface.co/datasets/MBZUAI/SocialMaze

Paper Structure

This paper contains 31 sections, 53 figures, 8 tables, 2 algorithms.

Figures (53)

  • Figure 1: Overview of the SocialMaze Benchmark. All tasks are built upon (a) Layered Social Interaction Graphs, a time-aware modeling framework for social networks. Based on this template, we instantiate (b) 6 task types, covering social reasoning games, daily life interactions, and digital community platforms. (c) illustrates one specific example of Hidden Role Deducution, including description of graphs along with both vertex-centric and graph-level queries.
  • Figure 2: Model performance in Hidden Role Deduction across four task variants with increasing information uncertainty. Accuracy is shown after 3 rounds.
  • Figure 3: Performance comparison of selected LLMs on SocialMaze tasks, highlighting different model strengths.
  • Figure 4: Performance comparison of Long CoT and Short CoT models. The line plot shows average accuracy; the bar plot shows the output length ratio (Long CoT / Short CoT). Orange bars indicate tasks with high deep reasoning demand, purple bars indicate low deep reasoning demand.
  • Figure 5: Performance in the Full task of Hidden Role Deduction, by model-assigned role. Models show reduced accuracy—especially in self-role identification—when assigned roles involving distorted self-perception (Rumormonger, Lunatic).
  • ...and 48 more figures