Table of Contents
Fetching ...

The Procedural Content Generation Benchmark: An Open-source Testbed for Generative Challenges in Games

Ahmed Khalifa, Roberto Gallotta, Matthew Barthet, Antonios Liapis, Julian Togelius, Georgios N. Yannakakis

TL;DR

The paper presents the Procedural Content Generation Benchmark, an open-source testbed to standardize evaluation of generative algorithms across diverse game content tasks. It formalizes three evaluation axes—quality, diversity, and controllability—and implements a modular, OpenAI Gym–like framework with 12 distinct PCG problems and problem-specific representations. Baseline search-based generators (Random, μ+λ Evolutionary Strategy, and Genetic Algorithm) are evaluated across all problems to illustrate task difficulty and the impact of fitness targets on feasible, controllable, and diverse artifacts. The benchmark enables rigorous, reproducible comparisons, supports education and experimentation, and serves as a foundation for future extensions, including more complex problems and integration with modern AI generators.

Abstract

This paper introduces the Procedural Content Generation Benchmark for evaluating generative algorithms on different game content creation tasks. The benchmark comes with 12 game-related problems with multiple variants on each problem. Problems vary from creating levels of different kinds to creating rule sets for simple arcade games. Each problem has its own content representation, control parameters, and evaluation metrics for quality, diversity, and controllability. This benchmark is intended as a first step towards a standardized way of comparing generative algorithms. We use the benchmark to score three baseline algorithms: a random generator, an evolution strategy, and a genetic algorithm. Results show that some problems are easier to solve than others, as well as the impact the chosen objective has on quality, diversity, and controllability of the generated artifacts.

The Procedural Content Generation Benchmark: An Open-source Testbed for Generative Challenges in Games

TL;DR

The paper presents the Procedural Content Generation Benchmark, an open-source testbed to standardize evaluation of generative algorithms across diverse game content tasks. It formalizes three evaluation axes—quality, diversity, and controllability—and implements a modular, OpenAI Gym–like framework with 12 distinct PCG problems and problem-specific representations. Baseline search-based generators (Random, μ+λ Evolutionary Strategy, and Genetic Algorithm) are evaluated across all problems to illustrate task difficulty and the impact of fitness targets on feasible, controllable, and diverse artifacts. The benchmark enables rigorous, reproducible comparisons, supports education and experimentation, and serves as a foundation for future extensions, including more complex problems and integration with modern AI generators.

Abstract

This paper introduces the Procedural Content Generation Benchmark for evaluating generative algorithms on different game content creation tasks. The benchmark comes with 12 game-related problems with multiple variants on each problem. Problems vary from creating levels of different kinds to creating rule sets for simple arcade games. Each problem has its own content representation, control parameters, and evaluation metrics for quality, diversity, and controllability. This benchmark is intended as a first step towards a standardized way of comparing generative algorithms. We use the benchmark to score three baseline algorithms: a random generator, an evolution strategy, and a genetic algorithm. Results show that some problems are easier to solve than others, as well as the impact the chosen objective has on quality, diversity, and controllability of the generated artifacts.

Paper Structure

This paper contains 14 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The system diagram for the PCG Benchmark that showcases how to use the framework. First, the Generator can sample an array of random content ($Array(c_r)$) and an array of random control parameters ($Array(p_r)$). Then the Generator returns an array of paired content and control parameters ($Array(c_g, p_g)$) to be evaluated. The system sends them to the current problem to calculate their values ($q(c_i)$ is the quality value for a content, $d(c_i,c_j)$ is the diversity value between two content, and $t(c_i,p_i)$ is the controllability value between a content and a control parameter). Finally, the system returns the results for quality ($R_q$), diversity ($R_d$), and controllability ($R_t$).
  • Figure 2: Progression of the maximum fitness when optimizing the Quality fitness with the three baseline algorithms. Results are averaged from 10 runs, with 95% confidence intervals as the shaded area.
  • Figure 3: Number of solutions $c_i$ (out of 100) in the final population (after 200 generations) that are feasible ($q(c_i) = 1$), controlled ($t(c_i, p_i)=1$), and unique ($d(c_i, \mathbb{C})=1$ compared to the final population $\mathbb{C}$). Results are averaged from 10 runs, with 95% confidence intervals as error bars.
  • Figure 4: The number of feasible and unique solutions over 100 separate runs on Binary, Sokoban, and Zelda using six different methods (three search-based generators, one constructive generator, and two few-shot LLM generators).