Table of Contents
Fetching ...

LLMs May Not Be Human-Level Players, But They Can Be Testers: Measuring Game Difficulty with LLM Agents

Chang Xiao, Brenda Z. Yang

TL;DR

This work introduces a general LLM-based framework for measuring game difficulty using minimally-tuned agents that interact with text representations of games. It validates the approach on Wordle and Slay the Spire, showing that while LLMs typically underperform humans in gameplay, their difficulty signals correlate significantly with human judgments, especially with GPT-4 and Chain-of-Thought prompting. The study demonstrates that model choice and prompting strategy substantially affect alignment with human difficulty, and it provides actionable guidelines for integrating LLM testers into game-design workflows. The results suggest LLMs can serve as scalable, general-purpose testers to shape relative difficulty curves and inform iterative game design.

Abstract

Recent advances in Large Language Models (LLMs) have demonstrated their potential as autonomous agents across various tasks. One emerging application is the use of LLMs in playing games. In this work, we explore a practical problem for the gaming industry: Can LLMs be used to measure game difficulty? We propose a general game-testing framework using LLM agents and test it on two widely played strategy games: Wordle and Slay the Spire. Our results reveal an interesting finding: although LLMs may not perform as well as the average human player, their performance, when guided by simple, generic prompting techniques, shows a statistically significant and strong correlation with difficulty indicated by human players. This suggests that LLMs could serve as effective agents for measuring game difficulty during the development process. Based on our experiments, we also outline general principles and guidelines for incorporating LLMs into the game testing process.

LLMs May Not Be Human-Level Players, But They Can Be Testers: Measuring Game Difficulty with LLM Agents

TL;DR

This work introduces a general LLM-based framework for measuring game difficulty using minimally-tuned agents that interact with text representations of games. It validates the approach on Wordle and Slay the Spire, showing that while LLMs typically underperform humans in gameplay, their difficulty signals correlate significantly with human judgments, especially with GPT-4 and Chain-of-Thought prompting. The study demonstrates that model choice and prompting strategy substantially affect alignment with human difficulty, and it provides actionable guidelines for integrating LLM testers into game-design workflows. The results suggest LLMs can serve as scalable, general-purpose testers to shape relative difficulty curves and inform iterative game design.

Abstract

Recent advances in Large Language Models (LLMs) have demonstrated their potential as autonomous agents across various tasks. One emerging application is the use of LLMs in playing games. In this work, we explore a practical problem for the gaming industry: Can LLMs be used to measure game difficulty? We propose a general game-testing framework using LLM agents and test it on two widely played strategy games: Wordle and Slay the Spire. Our results reveal an interesting finding: although LLMs may not perform as well as the average human player, their performance, when guided by simple, generic prompting techniques, shows a statistically significant and strong correlation with difficulty indicated by human players. This suggests that LLMs could serve as effective agents for measuring game difficulty during the development process. Based on our experiments, we also outline general principles and guidelines for incorporating LLMs into the game testing process.
Paper Structure (27 sections, 4 figures, 2 tables)

This paper contains 27 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of the proposed framework for LLM-based game difficulty testing. In each step of the game loop, game information is extracted via APIs, converted into natural language, and processed by the LLM along with additional details, such as game rules and strategies, using prompting techniques like Chain-of-Thought. The LLM outputs a suggested player action, which is translated into API calls or keyboard/mouse events to execute in-game. The loop continues until the challenge is completed or failed.
  • Figure 2: An example run of the LLM agent solving a Wordle puzzle. The LLM is initially provided with the game rules and type-specific prompt (e.g., Zero-Shot, CoT and CoT+). The LLM then generates its first guess and receives feedback from the game regarding correct and incorrect letters and their positions. Using this feedback, along with the same prompt from the previous turn, the LLM produces its next guess. This cycle continues until the LLM either guesses the correct word or exceeds the maximum number of allowed guesses.
  • Figure 3: An example of a turn in Slay the Spire played by LLMs. The top figure shows a screenshot of the game, along with information that can be perceived by a player. Below is an example interaction between the game and the LLM.
  • Figure 4: An example response from GPT-4 when prompted with CoT to play StS.