GameEval: Evaluating LLMs on Conversational Games

Dan Qiao; Chenfei Wu; Yaobo Liang; Juntao Li; Nan Duan

GameEval: Evaluating LLMs on Conversational Games

Dan Qiao, Chenfei Wu, Yaobo Liang, Juntao Li, Nan Duan

TL;DR

GameEval introduces a bias-resistant, ground-truth-free framework for evaluating LLMs by engaging them in goal-driven conversational games. The approach treats models as players with roles and long-term objectives, using diverse dialogue forms to measure integrated capabilities rather than single-task performance. It contributes three games (Ask-Guess, SpyFall, TofuKingdom) with tailored evaluation metrics and demonstrates clear discrimination among ChatGPT, GPT-4, and Text-Davinci-003. The work highlights the potential of game-based evaluation for assessing complex, real-world problem-solving in LLMs and provides public code to enable broader adoption and extension.

Abstract

The rapid advancements in large language models (LLMs) have presented challenges in evaluating those models. Existing evaluation methods are either reference-based or preference based, which inevitably need human intervention or introduce test bias caused by evaluator models. In this paper, we propose GameEval, a novel approach to evaluating LLMs through goal-driven conversational games, overcoming the limitations of previous methods. GameEval treats LLMs as game players and assigns them distinct roles with specific goals achieved by launching conversations of various forms, including discussion, question answering, and voting. We design three unique games with cooperative or adversarial objectives, accompanied by corresponding evaluation metrics, to show how this new paradigm comprehensively evaluates model performance.Through extensive experiments, we show that GameEval can effectively differentiate the capabilities of various LLMs, providing a comprehensive assessment of their integrated abilities to solve complex problems. Our public anonymous code is available at https://github.com/GameEval/GameEval.

GameEval: Evaluating LLMs on Conversational Games

TL;DR

Abstract

Paper Structure (35 sections, 6 figures, 4 tables)

This paper contains 35 sections, 6 figures, 4 tables.

Introduction
Related Work
Reference-based Evaluation
Preference-based Evaluation
GameEval
Ask-Guess
Game Introduction
Evaluation
SpyFall
Game Introduction
Evaluation Metrics
TofuKingdom
Game Introduction
Evaluation Metrics
Experiments
...and 20 more sections

Figures (6)

Figure 1: Comparison between our proposed GameEval and the widely used benchmarks. The slope represents the ratio between the performance score of ChatGPT and GPT-4. By playing goal-driven conversational games, GameEval provides more distinguishable results in ChatGPT vs GPT-4.
Figure 2: (a) An example of the game Ask-Guess, where the given word is "apple." (b) An example of the game SpyFall, where the common word is "iphone," and the spy word is "ipad." (c) An example of the game TofuKingdom.
Figure 3: A case to show the distinction in capabilities of ChatGPT and GPT-4 in Ask-Guess. The word to guess is mushroom.
Figure 4: We demonstrate the different input format for different types of LLMs in game Ask-Guess. (a) Pure text prompt for the common generative LLM. (b) Role-based messages for multi-turn chat models like ChatGPT.
Figure 5: Illustration of private history and the model's output with CoT in SpyFall.
...and 1 more figures

GameEval: Evaluating LLMs on Conversational Games

TL;DR

Abstract

GameEval: Evaluating LLMs on Conversational Games

Authors

TL;DR

Abstract

Table of Contents

Figures (6)