Evaluating and Modeling Social Intelligence: A Comparative Study of Human and AI Capabilities
Junqi Wang, Chunhui Zhang, Jiapeng Li, Yuxi Ma, Lixing Niu, Jiaheng Han, Yujia Peng, Yixin Zhu, Lifeng Fan
TL;DR
This work introduces a benchmark for evaluating social intelligence by formalizing a two-agent framework with forward planning, inverse reasoning, and inverse inverse planning. It implements a recursive Bayesian model that unifies two grid-world tasks, IR and IIP, and evaluates both humans and LLMs (GPT-3.5/4) on zero-shot and one-shot settings across text and multimodal inputs. Empirical results show humans consistently outperform LLMs, who operate at the most basic social-order level (order $0$) and rely more on pattern shortcuts than deep theory-of-mind understanding. The study demonstrates the benchmark’s diagnostic power and provides public-release of code, data, and human measurements to advance AI’s authentic social-cognition capabilities.
Abstract
Facing the current debate on whether Large Language Models (LLMs) attain near-human intelligence levels (Mitchell & Krakauer, 2023; Bubeck et al., 2023; Kosinski, 2023; Shiffrin & Mitchell, 2023; Ullman, 2023), the current study introduces a benchmark for evaluating social intelligence, one of the most distinctive aspects of human cognition. We developed a comprehensive theoretical framework for social dynamics and introduced two evaluation tasks: Inverse Reasoning (IR) and Inverse Inverse Planning (IIP). Our approach also encompassed a computational model based on recursive Bayesian inference, adept at elucidating diverse human behavioral patterns. Extensive experiments and detailed analyses revealed that humans surpassed the latest GPT models in overall performance, zero-shot learning, one-shot generalization, and adaptability to multi-modalities. Notably, GPT models demonstrated social intelligence only at the most basic order (order = 0), in stark contrast to human social intelligence (order >= 2). Further examination indicated a propensity of LLMs to rely on pattern recognition for shortcuts, casting doubt on their possession of authentic human-level social intelligence. Our codes, dataset, appendix and human data are released at https://github.com/bigai-ai/Evaluate-n-Model-Social-Intelligence.
