From Pixels to Titles: Video Game Identification by Screenshots using Convolutional Neural Networks
Fabricio Breve
TL;DR
This work tackles automatic identification of video game titles from single screenshots across 22 home consoles by evaluating 13 CNNs and 3 transformer architectures, using ImageNet pretraining and arcade-derived fine-tuning on a large Moby Games–based dataset (8,796 games, 170,881 screenshots). CNNs, particularly EfficientNetV2S, outperform transformers, achieving an average accuracy of 77.44% and best per-system results on 16 of 22 systems, with Atari 2600 showing high stability. Incorporating Arcade-weight initializations generally improves performance and reduces training time, boosting the overall best accuracy to 78.79% when selecting architectures and weights per system. The study highlights the efficacy of CNNs for screenshot-based game-title identification, while also outlining practical limitations and avenues for future work, such as system-detection and ensemble methods to scale to larger title sets.
Abstract
This paper investigates video game identification through single screenshots, utilizing ten convolutional neural network (CNN) architectures (VGG16, ResNet50, ResNet152, MobileNet, DenseNet169, DenseNet201, EfficientNetB0, EfficientNetB2, EfficientNetB3, and EfficientNetV2S) and three transformers architectures (ViT-B16, ViT-L32, and SwinT) across 22 home console systems, spanning from Atari 2600 to PlayStation 5, totalling 8,796 games and 170,881 screenshots. Except for VGG16, all CNNs outperformed the transformers in this task. Using ImageNet pre-trained weights as initial weights, EfficientNetV2S achieves the highest average accuracy (77.44%) and the highest accuracy in 16 of the 22 systems. DenseNet201 is the best in four systems and EfficientNetB3 is the best in the remaining two systems. Employing alternative initial weights fine-tuned in an arcade screenshots dataset boosts accuracy for EfficientNet architectures, with the EfficientNetV2S reaching a peak accuracy of 77.63% and demonstrating reduced convergence epochs from 26.9 to 24.5 on average. Overall, the combination of optimal architecture and weights attains 78.79% accuracy, primarily led by EfficientNetV2S in 15 systems. These findings underscore the efficacy of CNNs in video game identification through screenshots.
