Table of Contents
Fetching ...

Boardwalk: Towards a Framework for Creating Board Games with LLMs

Álvaro Guglielmin Becker, Gabriel Bauer de Oliveira, Lana Bertoldo Rossato, Anderson Rocha Tavares

TL;DR

This work investigates whether Large Language Models can generate digital versions of board games from natural-language rules using Boardwalk, a Python-based General Game Playing API. It conducts a broad evaluation across three LLMs and 12 anonymized games in three code-generation modes, totaling 108 experiments, to assess playability and rule compliance. Claude 3.7 Sonnet emerges as the best performer with $55.6 ext{ extpercent}$ of games produced without errors, though API usage tends to increase error frequency. The study introduces Boardwalk as a practical framework for rapid board game code generation and outlines future directions toward an AI-assisted board game creation framework.

Abstract

Implementing board games in code can be a time-consuming task. However, Large Language Models (LLMs) have been proven effective at generating code for domain-specific tasks with simple contextual information. We aim to investigate whether LLMs can implement digital versions of board games from rules described in natural language. This would be a step towards an LLM-assisted framework for quick board game code generation. We expect to determine the main challenges for LLMs to implement the board games, and how different approaches and models compare to one another. We task three state-of-the-art LLMs (Claude, DeepSeek and ChatGPT) with coding a selection of 12 popular and obscure games in free-form and within Boardwalk, our proposed General Game Playing API. We anonymize the games and components to avoid evoking pre-trained LLM knowledge. The implementations are tested for playability and rule compliance. We evaluate success rate and common errors across LLMs and game popularity. Our approach proves viable, with the best performing model, Claude 3.7 Sonnet, yielding 55.6\% of games without any errors. While compliance with the API increases error frequency, the severity of errors is more significantly dependent on the LLM. We outline future steps for creating a framework to integrate this process, making the elaboration of board games more accessible.

Boardwalk: Towards a Framework for Creating Board Games with LLMs

TL;DR

This work investigates whether Large Language Models can generate digital versions of board games from natural-language rules using Boardwalk, a Python-based General Game Playing API. It conducts a broad evaluation across three LLMs and 12 anonymized games in three code-generation modes, totaling 108 experiments, to assess playability and rule compliance. Claude 3.7 Sonnet emerges as the best performer with of games produced without errors, though API usage tends to increase error frequency. The study introduces Boardwalk as a practical framework for rapid board game code generation and outlines future directions toward an AI-assisted board game creation framework.

Abstract

Implementing board games in code can be a time-consuming task. However, Large Language Models (LLMs) have been proven effective at generating code for domain-specific tasks with simple contextual information. We aim to investigate whether LLMs can implement digital versions of board games from rules described in natural language. This would be a step towards an LLM-assisted framework for quick board game code generation. We expect to determine the main challenges for LLMs to implement the board games, and how different approaches and models compare to one another. We task three state-of-the-art LLMs (Claude, DeepSeek and ChatGPT) with coding a selection of 12 popular and obscure games in free-form and within Boardwalk, our proposed General Game Playing API. We anonymize the games and components to avoid evoking pre-trained LLM knowledge. The implementations are tested for playability and rule compliance. We evaluate success rate and common errors across LLMs and game popularity. Our approach proves viable, with the best performing model, Claude 3.7 Sonnet, yielding 55.6\% of games without any errors. While compliance with the API increases error frequency, the severity of errors is more significantly dependent on the LLM. We outline future steps for creating a framework to integrate this process, making the elaboration of board games more accessible.

Paper Structure

This paper contains 8 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Occurrences of each error type per game, pooled from all experiments.