Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models

Tian Liang; Zhiwei He; Jen-tse Huang; Wenxuan Wang; Wenxiang Jiao; Rui Wang; Yujiu Yang; Zhaopeng Tu; Shuming Shi; Xing Wang

Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models

Tian Liang, Zhiwei He, Jen-tse Huang, Wenxuan Wang, Wenxiang Jiao, Rui Wang, Yujiu Yang, Zhaopeng Tu, Shuming Shi, Xing Wang

TL;DR

The paper tackles the challenge of automatically evaluating LLM-based agents, addressing the limitations of costly human-annotated datasets. It introduces DEEP, a single-agent framework that probes descriptive accuracy and intentional disguise through aggressive and conservative word descriptions judged by GPT-4, and SpyGame, an interactive multi-agent framework that tests linguistic intelligence and theory of mind in a competitive word-guessing game. Extensive experiments compare open- and closed-source LLMs, reveal GPT-4’s superior performance, and analyze biases and robustness across seeds and configurations, supported by human validation. The work provides scalable, language- and domain-agnostic evaluation tools that can inform the development of more capable, adaptable LLM-based agents in diverse real-world settings.

Abstract

The automatic evaluation of LLM-based agent intelligence is critical in developing advanced LLM-based agents. Although considerable effort has been devoted to developing human-annotated evaluation datasets, such as AlpacaEval, existing techniques are costly, time-consuming, and lack adaptability. In this paper, inspired by the popular language game ``Who is Spy'', we propose to use the word guessing game to assess the intelligence performance of LLMs. Given a word, the LLM is asked to describe the word and determine its identity (spy or not) based on its and other players' descriptions. Ideally, an advanced agent should possess the ability to accurately describe a given word using an aggressive description while concurrently maximizing confusion in the conservative description, enhancing its participation in the game. To this end, we first develop DEEP to evaluate LLMs' expression and disguising abilities. DEEP requires LLM to describe a word in aggressive and conservative modes. We then introduce SpyGame, an interactive multi-agent framework designed to assess LLMs' intelligence through participation in a competitive language-based board game. Incorporating multi-agent interaction, SpyGame requires the target LLM to possess linguistic skills and strategic thinking, providing a more comprehensive evaluation of LLMs' human-like cognitive abilities and adaptability in complex communication situations. The proposed evaluation framework is very easy to implement. We collected words from multiple sources, domains, and languages and used the proposed evaluation framework to conduct experiments. Extensive experiments demonstrate that the proposed DEEP and SpyGame effectively evaluate the capabilities of various LLMs, capturing their ability to adapt to novel situations and engage in strategic communication.

Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models

TL;DR

Abstract

Paper Structure (37 sections, 3 figures, 8 tables, 1 algorithm)

This paper contains 37 sections, 3 figures, 8 tables, 1 algorithm.

Introduction
DEEP: Dual Expression Evaluation Program
Methodology
Prompting
Judging
Evaluation Metrics
Experiment
Result
SpyGame: An Interactive Multi-Agent Framework
Who is Spy
Game flow
Methodology
Keyword Set
Host and Guest Agents
Agent Action
...and 22 more sections

Figures (3)

Figure 1: SpyGame, our interactive multi-agent gaming framework, provides an engaging platform to assess the linguistic intelligence and deductive reasoning skills of large language models. This illustration depicts a scene from SpyGame, where Player 3 is the spy agent with the secret word "GPT", and other remaining players are villager agents with the assigned word "BERT". As Player 3 describes the text generation capabilities of the "GPT" model, Player 2 becomes increasingly suspicious due to the noticeable discrepancy between their respective words.
Figure 2: Illustration of DEEP. 1) Top: LLM describes "Batman" in an aggressive mode. This precise description demonstrates the extent of its mastery in the relevant knowledge. 2) Bottom: In conservative mode, the ambiguous description of "Batman" showcases the LLM's ability to intentionally disguise the target word while still maintaining a connection to its concept.
Figure 3: The suspicion probability of three naming methods.

Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models

TL;DR

Abstract

Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)