Cultivating Game Sense for Yourself: Making VLMs Gaming Experts
Wenxuan Lu, Jiangyang He, Zhanqiu Zhang, Yiwen Guo, Tianning Zang
TL;DR
This work introduces GameSense, a paradigm shift that repurposes vision-language models as developers of task-specific execution modules (GSMs) rather than direct game controllers. By coupling a High-Level VLM Agent with RL-based and Rule-based GSMs, GameSense achieves real-time, API-free gameplay across ACT, FPS, and Flappy Bird, without requiring pauses for reasoning. The framework leverages a vision-based environment analysis, memory, and three reflection modules to continually refine GSM behavior via learning or rule-based loops, supported by a standard toolset including a State Reader and vision processors. Empirical results show GameSense outperforms prior VLM-driven agents in both single-task combat and complete game flow, demonstrating fluent gameplay in diverse genres and highlighting the tradeoffs between RL-based adaptability and rule-based speed. Limitations include fixed GSM types per game and challenges in fully autonomous GSM discovery, suggesting directions toward reusable, scalable GSM generation and reuse strategies.
Abstract
Developing agents capable of fluid gameplay in first/third-person games without API access remains a critical challenge in Artificial General Intelligence (AGI). Recent efforts leverage Vision Language Models (VLMs) as direct controllers, frequently pausing the game to analyze screens and plan action through language reasoning. However, this inefficient paradigm fundamentally restricts agents to basic and non-fluent interactions: relying on isolated VLM reasoning for each action makes it impossible to handle tasks requiring high reactivity (e.g., FPS shooting) or dynamic adaptability (e.g., ACT combat). To handle this, we propose a paradigm shift in gameplay agent design: instead of directly controlling gameplay, VLM develops specialized execution modules tailored for tasks like shooting and combat. These modules handle real-time game interactions, elevating VLM to a high-level developer. Building upon this paradigm, we introduce GameSense, a gameplay agent framework where VLM develops task-specific game sense modules by observing task execution and leveraging vision tools and neural network training pipelines. These modules encapsulate action-feedback logic, ranging from direct action rules to neural network-based decisions. Experiments demonstrate that our framework is the first to achieve fluent gameplay in diverse genres, including ACT, FPS, and Flappy Bird, setting a new benchmark for game-playing agents.
