Level Up Your Tutorials: VLMs for Game Tutorials Quality Assessment
Daniele Rege Cambrin, Gabriele Scaffidi Militone, Luca Colomba, Giovanni Malnati, Daniele Apiletti, Paolo Garza
TL;DR
This work tackles the costly process of validating game tutorials by introducing a VLM-driven framework that treats tutorial frames as test cases. By extracting frames, annotating them with developer-defined QA pairs, and comparing VLM outputs against ground-truth answers, the approach quantifies tutorial clarity. The study benchmarks multiple models (GPT-4o, InternVL, DragonFly) on two tutorial versions (P and L) using ROUGE and BERT-Score, showing improvements in the latest version and offering scalable, automated feedback to developers. The results support the viability of automatic tutorial quality assessment and point to future work aligning VLM judgments with player perception and broadening the scope to other quality faults.
Abstract
Designing effective game tutorials is crucial for a smooth learning curve for new players, especially in games with many rules and complex core mechanics. Evaluating the effectiveness of these tutorials usually requires multiple iterations with testers who have no prior knowledge of the game. Recent Vision-Language Models (VLMs) have demonstrated significant capabilities in understanding and interpreting visual content. VLMs can analyze images, provide detailed insights, and answer questions about their content. They can recognize objects, actions, and contexts in visual data, making them valuable tools for various applications, including automated game testing. In this work, we propose an automated game-testing solution to evaluate the quality of game tutorials. Our approach leverages VLMs to analyze frames from video game tutorials, answer relevant questions to simulate human perception, and provide feedback. This feedback is compared with expected results to identify confusing or problematic scenes and highlight potential errors for developers. In addition, we publish complete tutorial videos and annotated frames from different game versions used in our tests. This solution reduces the need for extensive manual testing, especially by speeding up and simplifying the initial development stages of the tutorial to improve the final game experience.
