Table of Contents
Fetching ...

Not All the Same: Understanding and Informing Similarity Estimation in Tile-Based Video Games

Sebastian Berns, Vanessa Volz, Laurissa Tokarchuk, Sam Snodgrass, Christian Guckelsberger

TL;DR

This paper investigates how well automated similarity metrics align with human perception of visual similarity in tile-based game levels. It combines a large-scale human study with a diverse set of metrics (CV embeddings, PCG metrics, and general distances) to build perceptual embeddings via t-STE and evaluate metric performance across two popular titles and two representations. DreamSim emerges as the best overall predictor of human similarity (with CLIP close behind), while simpler PCG metrics like Tile Frequencies remain viable for low-resource contexts; larger, more complex pattern representations perform poorly. A follow-up qualitative study interprets the embedding dimensions, revealing dimensions such as pattern complexity, symmetry, and tile colours as central perceptual cues, and underscoring the role of sprite choices. Overall, the work provides practical guidance for metric selection in game development and lays a foundation for future data-driven, domain-specific similarity metrics, including potential ensemble approaches and broader stimulus sets.

Abstract

Similarity estimation is essential for many game AI applications, from the procedural generation of distinct assets to automated exploration with game-playing agents. While similarity metrics often substitute human evaluation, their alignment with our judgement is unclear. Consequently, the result of their application can fail human expectations, leading to e.g. unappreciated content or unbelievable agent behaviour. We alleviate this gap through a multi-factorial study of two tile-based games in two representations, where participants (N=456) judged the similarity of level triplets. Based on this data, we construct domain-specific perceptual spaces, encoding similarity-relevant attributes. We compare 12 metrics to these spaces and evaluate their approximation quality through several quantitative lenses. Moreover, we conduct a qualitative labelling study to identify the features underlying the human similarity judgement in this popular genre. Our findings inform the selection of existing metrics and highlight requirements for the design of new similarity metrics benefiting game development and research.

Not All the Same: Understanding and Informing Similarity Estimation in Tile-Based Video Games

TL;DR

This paper investigates how well automated similarity metrics align with human perception of visual similarity in tile-based game levels. It combines a large-scale human study with a diverse set of metrics (CV embeddings, PCG metrics, and general distances) to build perceptual embeddings via t-STE and evaluate metric performance across two popular titles and two representations. DreamSim emerges as the best overall predictor of human similarity (with CLIP close behind), while simpler PCG metrics like Tile Frequencies remain viable for low-resource contexts; larger, more complex pattern representations perform poorly. A follow-up qualitative study interprets the embedding dimensions, revealing dimensions such as pattern complexity, symmetry, and tile colours as central perceptual cues, and underscoring the role of sprite choices. Overall, the work provides practical guidance for metric selection in game development and lays a foundation for future data-driven, domain-specific similarity metrics, including potential ensemble approaches and broader stimulus sets.

Abstract

Similarity estimation is essential for many game AI applications, from the procedural generation of distinct assets to automated exploration with game-playing agents. While similarity metrics often substitute human evaluation, their alignment with our judgement is unclear. Consequently, the result of their application can fail human expectations, leading to e.g. unappreciated content or unbelievable agent behaviour. We alleviate this gap through a multi-factorial study of two tile-based games in two representations, where participants (N=456) judged the similarity of level triplets. Based on this data, we construct domain-specific perceptual spaces, encoding similarity-relevant attributes. We compare 12 metrics to these spaces and evaluate their approximation quality through several quantitative lenses. Moreover, we conduct a qualitative labelling study to identify the features underlying the human similarity judgement in this popular genre. Our findings inform the selection of existing metrics and highlight requirements for the design of new similarity metrics benefiting game development and research.
Paper Structure (29 sections, 11 figures, 3 tables)

This paper contains 29 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Mean squared errors (lower is better; horizontal axes) when comparing the pairwise similarity matrices of different candidate metrics (vertical axis) to those derived from the perceptual embeddings of the four experimental conditions (subplots).
  • Figure 2: Cohen’s kappa (higher is better): inter-rater agreement between human participants and computational metrics over all experimental conditions (subplots). Summaries here show box plots with median values and the interquartile ranges. Full raincloud plots can be found in \ref{['appendix:study1']}.
  • Figure 3: Elbow plots for t-STE goodness of fit in all conditions (from top left to bottom right: ccs-img, ccs-pat, loz-img, loz-pat). We choose 4 as the number of dimensions (horizontal axis) for perceptual embeddings based on the evaluation of overall normalised errors (vertical axis).
  • Figure 4: Cohen’s kappa (higher is better): inter-rater agreement between human participants and computational metrics over all experimental conditions (subplots). Each data point indicates Cohen’s kappa comparing the similarity judgements of a single participant against those of a given metric on the same subset of triplets. Each raincloud plot features individual data points as dots, the estimated kernel density over the data as a curve above the data points, and a box plot with the sample minimum, maximum and median, as well as the first and third quartiles and outliers.
  • Figure 5: Unachieved agreement (lower is better): difference of the maximum value and Cohen’s kappa of the inter-rater agreement between human participants and computational metrics over all experimental conditions (subplots). Each data point indicates Cohen’s kappa subtracted from $\kappa_\text{max}$, when comparing the similarity judgements of a single participant against those of a given metric on the same subset of triplets. Each raincloud plot features individual data points as dots, the estimated kernel density over the data as a curve above the data points, and a box plot with the sample minimum, maximum and median, as well as the first and third quartiles and outliers.
  • ...and 6 more figures