Table of Contents
Fetching ...

InterIntent: Investigating Social Intelligence of LLMs via Intention Understanding in an Interactive Game Context

Ziyi Liu, Abhishek Anand, Pei Zhou, Jen-tse Huang, Jieyu Zhao

TL;DR

A novel framework, InterIntent, is developed to assess LLMs’ social intelligence by mapping their ability to understand and manage intentions in a game setting, focusing on four dimensions of social intelligence: situational awareness, self-regulation, self-awareness, and theory of mind.

Abstract

Large language models (LLMs) have demonstrated the potential to mimic human social intelligence. However, most studies focus on simplistic and static self-report or performance-based tests, which limits the depth and validity of the analysis. In this paper, we developed a novel framework, InterIntent, to assess LLMs' social intelligence by mapping their ability to understand and manage intentions in a game setting. We focus on four dimensions of social intelligence: situational awareness, self-regulation, self-awareness, and theory of mind. Each dimension is linked to a specific game task: intention selection, intention following, intention summarization, and intention guessing. Our findings indicate that while LLMs exhibit high proficiency in selecting intentions, achieving an accuracy of 88%, their ability to infer the intentions of others is significantly weaker, trailing human performance by 20%. Additionally, game performance correlates with intention understanding, highlighting the importance of the four components towards success in this game. These findings underline the crucial role of intention understanding in evaluating LLMs' social intelligence and highlight the potential of using social deduction games as a complex testbed to enhance LLM evaluation. InterIntent contributes a structured approach to bridging the evaluation gap in social intelligence within multiplayer games.

InterIntent: Investigating Social Intelligence of LLMs via Intention Understanding in an Interactive Game Context

TL;DR

A novel framework, InterIntent, is developed to assess LLMs’ social intelligence by mapping their ability to understand and manage intentions in a game setting, focusing on four dimensions of social intelligence: situational awareness, self-regulation, self-awareness, and theory of mind.

Abstract

Large language models (LLMs) have demonstrated the potential to mimic human social intelligence. However, most studies focus on simplistic and static self-report or performance-based tests, which limits the depth and validity of the analysis. In this paper, we developed a novel framework, InterIntent, to assess LLMs' social intelligence by mapping their ability to understand and manage intentions in a game setting. We focus on four dimensions of social intelligence: situational awareness, self-regulation, self-awareness, and theory of mind. Each dimension is linked to a specific game task: intention selection, intention following, intention summarization, and intention guessing. Our findings indicate that while LLMs exhibit high proficiency in selecting intentions, achieving an accuracy of 88%, their ability to infer the intentions of others is significantly weaker, trailing human performance by 20%. Additionally, game performance correlates with intention understanding, highlighting the importance of the four components towards success in this game. These findings underline the crucial role of intention understanding in evaluating LLMs' social intelligence and highlight the potential of using social deduction games as a complex testbed to enhance LLM evaluation. InterIntent contributes a structured approach to bridging the evaluation gap in social intelligence within multiplayer games.
Paper Structure (41 sections, 23 figures, 11 tables)

This paper contains 41 sections, 23 figures, 11 tables.

Figures (23)

  • Figure 1: Four dimensions to assess social intelligence in Avalon. We provide a dynamic and complex gaming context for evaluations. For situational awareness, we provide both positive and negative examples. In the negative example, since the previous quest was successful and no player was in a failed quest, the intention is inappropriate as there was no failed quest. For self-regulation, we require models to provide explicit information rather than repeating the intentions. Intentions are in bold within the contexts.
  • Figure 2: The Avalon game process for one round. Left: the entire game pipeline. Right: the procedure for generating a single player's speech.
  • Figure 3: Self-regulation results. The results show the percentage of each score over all data samples. Scores 1-5 are evaluation criteria (Table \ref{['tab:score_example']}). Score 5 means the best while score 1 means the worst.
  • Figure 4: Correlation between Intention Selection/Following and game performance. We present the percentages of games where evil players are equally, better, or worse than loyal players. For example, in games won by loyal players in (a), their performance matches or exceeds that of evil players. We mark the performance differences between evil and loyal players in red, showing a greater gap in successful games/quests compared to failed ones.
  • Figure 5: ToM results over rounds. We provide 200 data points for human results on the GPT-3.5 games, and since usually, games stop at round 4, the results from round 5 are not included.
  • ...and 18 more figures