Not All the Same: Understanding and Informing Similarity Estimation in Tile-Based Video Games
Sebastian Berns, Vanessa Volz, Laurissa Tokarchuk, Sam Snodgrass, Christian Guckelsberger
TL;DR
This paper investigates how well automated similarity metrics align with human perception of visual similarity in tile-based game levels. It combines a large-scale human study with a diverse set of metrics (CV embeddings, PCG metrics, and general distances) to build perceptual embeddings via t-STE and evaluate metric performance across two popular titles and two representations. DreamSim emerges as the best overall predictor of human similarity (with CLIP close behind), while simpler PCG metrics like Tile Frequencies remain viable for low-resource contexts; larger, more complex pattern representations perform poorly. A follow-up qualitative study interprets the embedding dimensions, revealing dimensions such as pattern complexity, symmetry, and tile colours as central perceptual cues, and underscoring the role of sprite choices. Overall, the work provides practical guidance for metric selection in game development and lays a foundation for future data-driven, domain-specific similarity metrics, including potential ensemble approaches and broader stimulus sets.
Abstract
Similarity estimation is essential for many game AI applications, from the procedural generation of distinct assets to automated exploration with game-playing agents. While similarity metrics often substitute human evaluation, their alignment with our judgement is unclear. Consequently, the result of their application can fail human expectations, leading to e.g. unappreciated content or unbelievable agent behaviour. We alleviate this gap through a multi-factorial study of two tile-based games in two representations, where participants (N=456) judged the similarity of level triplets. Based on this data, we construct domain-specific perceptual spaces, encoding similarity-relevant attributes. We compare 12 metrics to these spaces and evaluate their approximation quality through several quantitative lenses. Moreover, we conduct a qualitative labelling study to identify the features underlying the human similarity judgement in this popular genre. Our findings inform the selection of existing metrics and highlight requirements for the design of new similarity metrics benefiting game development and research.
