On the Evaluation of Procedural Level Generation Systems
Oliver Withington, Michael Cook, Laurissa Tokarchuk
TL;DR
This paper analyzes how procedural content generation (PCG) systems for game levels are evaluated, proposing a novel taxonomy built around data extraction, metrics, comparison points, and game domain. By surveying 86 recent publications, it reveals broad methodological diversity, a reliance on internal comparisons, and prevalent use of fitness and aesthetic measures, while warning that many studies use original domains that impede cross-study benchmarking. The authors argue for three improvements: allowing evaluation-free system descriptions when appropriate, developing shared frameworks and benchmarks, and promoting reuse of prior methodologies to improve comparability. These insights aim to align academic PCG evaluation with practical industry needs and to accelerate robust, reproducible progress in the field. The work also discusses future directions involving generative AI’s impact, meta-analysis of PCG domains, and the development of more generalizable evaluation practices.
Abstract
The evaluation of procedural content generation (PCG) systems for generating video game levels is a complex and contested topic. Ideally, the field would have access to robust, generalisable and widely accepted evaluation approaches that can be used to compare novel PCG systems to prior work, but consensus on how to evaluate novel systems is currently limited. We argue that the field can benefit from a structured analysis of how procedural level generation systems can be evaluated, and how these techniques are currently used by researchers. This analysis can then be used to both inform on the current state of affairs, and to provide data to justify changes to this practice. This work aims to provide this by first developing a novel taxonomy of PCG evaluation approaches, and then presenting the results of a survey of recent work in the field through the lens of this taxonomy. The results of this survey highlight several important weaknesses in current practice which we argue could be substantially mitigated by 1) promoting use of evaluation free system descriptions where appropriate, 2) promoting the development of diverse research frameworks, 3) promoting reuse of code and methodology wherever possible.
