Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level Policies in Atari Games
Nicholas R. Waytowich, Devin White, MD Sunbeam, Vinicius G. Goecks
TL;DR
This work introduces Atari-GPT, a benchmark to evaluate multimodal LLMs as low-level policies in Atari games. It systematically tests GPT-4V Turbo, GPT-4o, Gemini 1.5 Flash, and Claude 3 Haiku on seven Atari environments using a JSON action protocol and frame-skipping to assess gameplay, visual understanding, and spatial reasoning. The findings show that current multimodal LLMs fall short of zero-shot control and RL/human performance, primarily due to spatial reasoning and latency limits, while offering a structured framework and metrics to quantify progress toward real-time, image-grounded control. The study establishes a foundational benchmark for the community to gauge advancements in moving multimodal LLMs from high-level planning toward low-level environmental interaction and decision-making.
Abstract
Recent advancements in large language models (LLMs) have expanded their capabilities beyond traditional text-based tasks to multimodal domains, integrating visual, auditory, and textual data. While multimodal LLMs have been extensively explored for high-level planning in domains like robotics and games, their potential as low-level controllers remains largely untapped. In this paper, we introduce a novel benchmark aimed at testing the emergent capabilities of multimodal LLMs as low-level policies in Atari games. Unlike traditional reinforcement learning (RL) methods that require training for each new environment and reward function specification, these LLMs utilize pre-existing multimodal knowledge to directly engage with game environments. Our study assesses the performances of multiple multimodal LLMs against traditional RL agents, human players, and random agents, focusing on their ability to understand and interact with complex visual scenes and formulate strategic responses. Our results show that these multimodal LLMs are not yet capable of being zero-shot low-level policies. Furthermore, we see that this is, in part, due to their visual and spatial reasoning. Additional results and videos are available on our project webpage: https://dev1nw.github.io/atari-gpt/.
