Table of Contents
Fetching ...

Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level Policies in Atari Games

Nicholas R. Waytowich, Devin White, MD Sunbeam, Vinicius G. Goecks

TL;DR

This work introduces Atari-GPT, a benchmark to evaluate multimodal LLMs as low-level policies in Atari games. It systematically tests GPT-4V Turbo, GPT-4o, Gemini 1.5 Flash, and Claude 3 Haiku on seven Atari environments using a JSON action protocol and frame-skipping to assess gameplay, visual understanding, and spatial reasoning. The findings show that current multimodal LLMs fall short of zero-shot control and RL/human performance, primarily due to spatial reasoning and latency limits, while offering a structured framework and metrics to quantify progress toward real-time, image-grounded control. The study establishes a foundational benchmark for the community to gauge advancements in moving multimodal LLMs from high-level planning toward low-level environmental interaction and decision-making.

Abstract

Recent advancements in large language models (LLMs) have expanded their capabilities beyond traditional text-based tasks to multimodal domains, integrating visual, auditory, and textual data. While multimodal LLMs have been extensively explored for high-level planning in domains like robotics and games, their potential as low-level controllers remains largely untapped. In this paper, we introduce a novel benchmark aimed at testing the emergent capabilities of multimodal LLMs as low-level policies in Atari games. Unlike traditional reinforcement learning (RL) methods that require training for each new environment and reward function specification, these LLMs utilize pre-existing multimodal knowledge to directly engage with game environments. Our study assesses the performances of multiple multimodal LLMs against traditional RL agents, human players, and random agents, focusing on their ability to understand and interact with complex visual scenes and formulate strategic responses. Our results show that these multimodal LLMs are not yet capable of being zero-shot low-level policies. Furthermore, we see that this is, in part, due to their visual and spatial reasoning. Additional results and videos are available on our project webpage: https://dev1nw.github.io/atari-gpt/.

Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level Policies in Atari Games

TL;DR

This work introduces Atari-GPT, a benchmark to evaluate multimodal LLMs as low-level policies in Atari games. It systematically tests GPT-4V Turbo, GPT-4o, Gemini 1.5 Flash, and Claude 3 Haiku on seven Atari environments using a JSON action protocol and frame-skipping to assess gameplay, visual understanding, and spatial reasoning. The findings show that current multimodal LLMs fall short of zero-shot control and RL/human performance, primarily due to spatial reasoning and latency limits, while offering a structured framework and metrics to quantify progress toward real-time, image-grounded control. The study establishes a foundational benchmark for the community to gauge advancements in moving multimodal LLMs from high-level planning toward low-level environmental interaction and decision-making.

Abstract

Recent advancements in large language models (LLMs) have expanded their capabilities beyond traditional text-based tasks to multimodal domains, integrating visual, auditory, and textual data. While multimodal LLMs have been extensively explored for high-level planning in domains like robotics and games, their potential as low-level controllers remains largely untapped. In this paper, we introduce a novel benchmark aimed at testing the emergent capabilities of multimodal LLMs as low-level policies in Atari games. Unlike traditional reinforcement learning (RL) methods that require training for each new environment and reward function specification, these LLMs utilize pre-existing multimodal knowledge to directly engage with game environments. Our study assesses the performances of multiple multimodal LLMs against traditional RL agents, human players, and random agents, focusing on their ability to understand and interact with complex visual scenes and formulate strategic responses. Our results show that these multimodal LLMs are not yet capable of being zero-shot low-level policies. Furthermore, we see that this is, in part, due to their visual and spatial reasoning. Additional results and videos are available on our project webpage: https://dev1nw.github.io/atari-gpt/.
Paper Structure (19 sections, 7 figures, 2 tables)

This paper contains 19 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Atari-GPT: System diagram: illustrates the integration of a multimodal large language model (LLM) as a low-level agent within the Atari gaming environment. It highlights the flow of inputs from the game to the LLM and back, demonstrating how the model processes game observations and generates corresponding actions. Additionally, the diagram includes the framework for human evaluation, which assesses the LLM's capabilities in visual understanding, spatial reasoning, strategic intuition, and environment recognition through a structured Q&A process.
  • Figure 2: Images used in Understanding tasks
  • Figure 3: Normalized Average Reward for GPT-4V Turbo, GPT-4o, and Gemini 1.5 Flash.
  • Figure 4: Average Human Normalized reward for each environment.
  • Figure 5: Visual, spatial, strategic and identification results. Percent average for 2 runs.
  • ...and 2 more figures