Table of Contents
Fetching ...

VideoGameBunny: Towards vision assistants for video games

Mohammad Reza Taesiri, Cor-Paul Bezemer

TL;DR

This work targets the gap in open-source vision-language models’ ability to understand video game content. It introduces VideoGameBunny, an 8B-parameter, LLaVA-style model fine-tuned on a game-focused instruction dataset built atop Bunny, using a combination of image captions, image-to-JSON, and QA data to ground visual content in structured textual form. An empirical study shows that carefully curated, domain-specific data, especially image-to-JSON, enables a smaller model to outperform a much larger SOTA open-source model on game understanding tasks, with VideoGameBunny achieving 85.1% accuracy compared to 83.9% for LLaVA-1.6-34b. The results emphasize the value of high-quality, domain-focused instruction data and data-mixture strategies, and the work provides replication materials for further research in game-playing, commentary, and debugging with LMMs.

Abstract

Large multimodal models (LMMs) hold substantial promise across various domains, from personal assistance in daily tasks to sophisticated applications like medical diagnostics. However, their capabilities have limitations in the video game domain, such as challenges with scene understanding, hallucinations, and inaccurate descriptions of video game content, especially in open-source models. This paper describes the development of VideoGameBunny, a LLaVA-style model based on Bunny, specifically tailored for understanding images from video games. We release intermediate checkpoints, training logs, and an extensive dataset comprising 185,259 video game images from 413 titles, along with 389,565 image-instruction pairs that include image captions, question-answer pairs, and a JSON representation of 16 elements of 136,974 images. Our experiments show that our high quality game-related data has the potential to make a relatively small model outperform the much larger state-of-the-art model LLaVa-1.6-34b (which has more than 4x the number of parameters). Our study paves the way for future research in video game understanding on tasks such as playing, commentary, and debugging. Code and data are available at https://videogamebunny.github.io/

VideoGameBunny: Towards vision assistants for video games

TL;DR

This work targets the gap in open-source vision-language models’ ability to understand video game content. It introduces VideoGameBunny, an 8B-parameter, LLaVA-style model fine-tuned on a game-focused instruction dataset built atop Bunny, using a combination of image captions, image-to-JSON, and QA data to ground visual content in structured textual form. An empirical study shows that carefully curated, domain-specific data, especially image-to-JSON, enables a smaller model to outperform a much larger SOTA open-source model on game understanding tasks, with VideoGameBunny achieving 85.1% accuracy compared to 83.9% for LLaVA-1.6-34b. The results emphasize the value of high-quality, domain-focused instruction data and data-mixture strategies, and the work provides replication materials for further research in game-playing, commentary, and debugging with LMMs.

Abstract

Large multimodal models (LMMs) hold substantial promise across various domains, from personal assistance in daily tasks to sophisticated applications like medical diagnostics. However, their capabilities have limitations in the video game domain, such as challenges with scene understanding, hallucinations, and inaccurate descriptions of video game content, especially in open-source models. This paper describes the development of VideoGameBunny, a LLaVA-style model based on Bunny, specifically tailored for understanding images from video games. We release intermediate checkpoints, training logs, and an extensive dataset comprising 185,259 video game images from 413 titles, along with 389,565 image-instruction pairs that include image captions, question-answer pairs, and a JSON representation of 16 elements of 136,974 images. Our experiments show that our high quality game-related data has the potential to make a relatively small model outperform the much larger state-of-the-art model LLaVa-1.6-34b (which has more than 4x the number of parameters). Our study paves the way for future research in video game understanding on tasks such as playing, commentary, and debugging. Code and data are available at https://videogamebunny.github.io/
Paper Structure (22 sections, 20 figures, 6 tables)

This paper contains 22 sections, 20 figures, 6 tables.

Figures (20)

  • Figure 1: VideoGameBunny is a model specifically fine-tuned on video game content, enabling it to understand game contexts and respond to related questions more accurately.
  • Figure 2: Architecture overview of VideoGameBunny. An image input and a textual instruction are fed into the language model to produce a response. The image is passed through a separate pre-trained vision encoder and a projection layer to align the embedding space between the two models. and icons show trainable and frozen layers respectively
  • Figure 3: Our dataset includes sample video game images that showcase a wide range of characters, environments, mechanics, camera viewpoints, and artistic styles. These styles vary from western to contemporary and futuristic, and from realistic to fantasy settings.
  • Figure 4: Overview of the dataset generation process.
  • Figure 5: Sample information extracted for the image-to-JSON dataset by Gemini-1.5-Pro. Each sample contains detailed information ranging from minor details to high-level descriptions, such as: 1 player inventory, 23 details about the environment, 4 non-player characters, 5 the screenshot's watermark, and 6 lighting.
  • ...and 15 more figures