Table of Contents
Fetching ...

Evaluating and Modeling Social Intelligence: A Comparative Study of Human and AI Capabilities

Junqi Wang, Chunhui Zhang, Jiapeng Li, Yuxi Ma, Lixing Niu, Jiaheng Han, Yujia Peng, Yixin Zhu, Lifeng Fan

TL;DR

This work introduces a benchmark for evaluating social intelligence by formalizing a two-agent framework with forward planning, inverse reasoning, and inverse inverse planning. It implements a recursive Bayesian model that unifies two grid-world tasks, IR and IIP, and evaluates both humans and LLMs (GPT-3.5/4) on zero-shot and one-shot settings across text and multimodal inputs. Empirical results show humans consistently outperform LLMs, who operate at the most basic social-order level (order $0$) and rely more on pattern shortcuts than deep theory-of-mind understanding. The study demonstrates the benchmark’s diagnostic power and provides public-release of code, data, and human measurements to advance AI’s authentic social-cognition capabilities.

Abstract

Facing the current debate on whether Large Language Models (LLMs) attain near-human intelligence levels (Mitchell & Krakauer, 2023; Bubeck et al., 2023; Kosinski, 2023; Shiffrin & Mitchell, 2023; Ullman, 2023), the current study introduces a benchmark for evaluating social intelligence, one of the most distinctive aspects of human cognition. We developed a comprehensive theoretical framework for social dynamics and introduced two evaluation tasks: Inverse Reasoning (IR) and Inverse Inverse Planning (IIP). Our approach also encompassed a computational model based on recursive Bayesian inference, adept at elucidating diverse human behavioral patterns. Extensive experiments and detailed analyses revealed that humans surpassed the latest GPT models in overall performance, zero-shot learning, one-shot generalization, and adaptability to multi-modalities. Notably, GPT models demonstrated social intelligence only at the most basic order (order = 0), in stark contrast to human social intelligence (order >= 2). Further examination indicated a propensity of LLMs to rely on pattern recognition for shortcuts, casting doubt on their possession of authentic human-level social intelligence. Our codes, dataset, appendix and human data are released at https://github.com/bigai-ai/Evaluate-n-Model-Social-Intelligence.

Evaluating and Modeling Social Intelligence: A Comparative Study of Human and AI Capabilities

TL;DR

This work introduces a benchmark for evaluating social intelligence by formalizing a two-agent framework with forward planning, inverse reasoning, and inverse inverse planning. It implements a recursive Bayesian model that unifies two grid-world tasks, IR and IIP, and evaluates both humans and LLMs (GPT-3.5/4) on zero-shot and one-shot settings across text and multimodal inputs. Empirical results show humans consistently outperform LLMs, who operate at the most basic social-order level (order ) and rely more on pattern shortcuts than deep theory-of-mind understanding. The study demonstrates the benchmark’s diagnostic power and provides public-release of code, data, and human measurements to advance AI’s authentic social-cognition capabilities.

Abstract

Facing the current debate on whether Large Language Models (LLMs) attain near-human intelligence levels (Mitchell & Krakauer, 2023; Bubeck et al., 2023; Kosinski, 2023; Shiffrin & Mitchell, 2023; Ullman, 2023), the current study introduces a benchmark for evaluating social intelligence, one of the most distinctive aspects of human cognition. We developed a comprehensive theoretical framework for social dynamics and introduced two evaluation tasks: Inverse Reasoning (IR) and Inverse Inverse Planning (IIP). Our approach also encompassed a computational model based on recursive Bayesian inference, adept at elucidating diverse human behavioral patterns. Extensive experiments and detailed analyses revealed that humans surpassed the latest GPT models in overall performance, zero-shot learning, one-shot generalization, and adaptability to multi-modalities. Notably, GPT models demonstrated social intelligence only at the most basic order (order = 0), in stark contrast to human social intelligence (order >= 2). Further examination indicated a propensity of LLMs to rely on pattern recognition for shortcuts, casting doubt on their possession of authentic human-level social intelligence. Our codes, dataset, appendix and human data are released at https://github.com/bigai-ai/Evaluate-n-Model-Social-Intelligence.
Paper Structure (39 sections, 1 equation, 15 figures, 8 tables, 7 algorithms)

This paper contains 39 sections, 1 equation, 15 figures, 8 tables, 7 algorithms.

Figures (15)

  • Figure 1: A unified framework of social dynamics. The foundational unit of human social interaction is exemplified by the actor $i$ and the observer $j$. This interaction is characterized by recursive mind reasoning, leading to the formation of a multi-layered cognitive architecture termed as "N Minds" fan2021learning. This structure encompasses various levels of cognitive processing, including 0th-order minds, 1st-order minds, and 2nd-order minds. Our framework primarily concentrates on three critical mental operations: (i) Forward Planning, where actors strategize future actions based on current states; (ii) Inverse Reasoning, involving the observer's deduction of underlying actor motives from observed actions; and (iii) Inverse Inverse Planning, a higher-order cognitive process where the actor anticipates the observer's inferences and plans actions accordingly.
  • Figure 2: Evaluation tasks: ir (left) and iip (right). The ir task involves observer Alice analyzing actor Bob's trajectory to deduce his preferred food truck. In the iip task, actor Carol strategizes her route to efficiently convey her restaurant preference to observer David.
  • Figure 3: Input stimuli examples for both tasks. (a) Scene layout and actor's trajectory in the ir task; (b) Agent perception field in ir; (c) Scene layout for the iip task; (d)-(g) Four potential routes for the actor in the iip task scenario. During testing, routes are randomly shuffled to ensure unbiased assessment.
  • Figure 4: ir task types. (a) Intermediate: represented by $M>\{X,Y,Z,N\}$, indicates that $M$ is preferred over the others $X$, $Y$, $Z$, and $N$; (b) Last: Characterized by $Y>\{X, Z, M\}$, suggests that $Y$ is chosen last among the visible options, leaving the preference for the absent $N$ as uncertain; (c) Previsited: depicted as $N>Z>\{X,Y,M\}$, the actor revisits and chooses $Z$ after seeing all options, implying preference for $N$ over $Z$, and $Z$ over $X$, $Y$, and $M$.
  • Figure 5: iip task types with Hybrid routes. (a) Type I: Cyclic route, revisiting a location and passing an alternative restaurant $Y$; (b) Type II: A cyclic route that does not entail passing through the vicinity of restaurant $Y$; (c) Type III: An acyclic route passing by the alternative restaurant $Y$; (d) Type IV: An acyclic route that avoids the vicinity of restaurant $Y$. Each type presents distinct problem patterns and difficulties for the actor to communicate their preference.
  • ...and 10 more figures