Table of Contents
Fetching ...

Codenames as a Benchmark for Large Language Models

Matthew Stephenson, Matthew Sidji, Benoît Ronval

TL;DR

The paper reframes Codenames as a robust benchmark for evaluating Large Language Models on language understanding, theory of mind, and epistemic reasoning. It introduces an updated Codenames AI framework that faithfully replicates full game rules and supports both single-team and two-team play, then benchmarks multiple state-of-the-art LLMs against traditional word-vector baselines with rigorous prompting and evaluation. Key findings show that LLMs display distinct emergent playstyles and generalize better when paired with diverse teammates, with performance differences between single-team (cautious strategies) and two-team (riskier strategies) settings; notably, o1-preview often excels in cooperative play. The work demonstrates Codenames’ potential as a multifaceted benchmark for assessing language-based reasoning, strategic decision-making, and cooperative AI, with implications for designing robust, adaptable agents and understanding emergent behaviors in LLM-driven games.

Abstract

In this paper, we propose the use of the popular word-based board game Codenames as a suitable benchmark for evaluating the reasoning capabilities of Large Language Models (LLMs). Codenames presents a highly interesting challenge for achieving successful AI performance, requiring both a sophisticated understanding of language, theory of mind, and epistemic reasoning capabilities. Prior attempts to develop agents for Codenames have largely relied on word embedding techniques, which have a limited vocabulary range and perform poorly when paired with differing approaches. LLMs have demonstrated enhanced reasoning and comprehension capabilities for language-based tasks, but can still suffer in lateral thinking challenges. We evaluate the capabilities of several state-of-the-art LLMs, including GPT-4o, Gemini 1.5, Claude 3.5 Sonnet, and Llama 3.1, across a variety of board setups. Our results indicate that while certain LLMs perform better than others overall, different models exhibit varying emergent behaviours during gameplay and excel at specific roles. We also evaluate the performance of different combinations of LLMs when playing cooperatively together, demonstrating that LLM agents are more generalisable to a wider range of teammates than prior techniques.

Codenames as a Benchmark for Large Language Models

TL;DR

The paper reframes Codenames as a robust benchmark for evaluating Large Language Models on language understanding, theory of mind, and epistemic reasoning. It introduces an updated Codenames AI framework that faithfully replicates full game rules and supports both single-team and two-team play, then benchmarks multiple state-of-the-art LLMs against traditional word-vector baselines with rigorous prompting and evaluation. Key findings show that LLMs display distinct emergent playstyles and generalize better when paired with diverse teammates, with performance differences between single-team (cautious strategies) and two-team (riskier strategies) settings; notably, o1-preview often excels in cooperative play. The work demonstrates Codenames’ potential as a multifaceted benchmark for assessing language-based reasoning, strategic decision-making, and cooperative AI, with implications for designing robust, adaptable agents and understanding emergent behaviors in LLM-driven games.

Abstract

In this paper, we propose the use of the popular word-based board game Codenames as a suitable benchmark for evaluating the reasoning capabilities of Large Language Models (LLMs). Codenames presents a highly interesting challenge for achieving successful AI performance, requiring both a sophisticated understanding of language, theory of mind, and epistemic reasoning capabilities. Prior attempts to develop agents for Codenames have largely relied on word embedding techniques, which have a limited vocabulary range and perform poorly when paired with differing approaches. LLMs have demonstrated enhanced reasoning and comprehension capabilities for language-based tasks, but can still suffer in lateral thinking challenges. We evaluate the capabilities of several state-of-the-art LLMs, including GPT-4o, Gemini 1.5, Claude 3.5 Sonnet, and Llama 3.1, across a variety of board setups. Our results indicate that while certain LLMs perform better than others overall, different models exhibit varying emergent behaviours during gameplay and excel at specific roles. We also evaluate the performance of different combinations of LLMs when playing cooperatively together, demonstrating that LLM agents are more generalisable to a wider range of teammates than prior techniques.

Paper Structure

This paper contains 39 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Codenames example board setup (seed = 0). Words associated with each team are shown in red or blue, civilian words are shown in grey, and the assassin word is shown in purple.
  • Figure 2: Average clue number provided by each codemaster model as the turn number increases (single team version).