Table of Contents
Fetching ...

BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games

Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuciński, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, Tim Rocktäschel

TL;DR

<p>BALROG addresses the need for rigorous, long-horizon evaluation of agentic capabilities in LLMs and VLMs by introducing a diverse benchmark comprising six challenging, procedurally generated games (BabyAI, Crafter, TextWorld, Baba Is AI, MiniHack, NetHack) with fine-grained, 0–100 scoring and an open toolkit for zero-shot and inference-time prompting evaluation. The framework decouples prompting strategies from models, enabling rapid prototyping and fair comparisons, and introduces a data-informed NetHack progression metric to better reflect true task advancement. Across multiple state-of-the-art models and both language-only and vision-language modalities, BALROG reveals that current systems excel on easy tasks but struggle drastically on hard, long-horizon tasks, with vision-based decision-making particularly challenging. The paper also discusses open research directions, including ICL, advanced reasoning, VLM limitations, and computational bottlenecks, and provides an open-source platform to spur future progress toward autonomous agentic agents.</p>

Abstract

Large Language Models (LLMs) and Vision Language Models (VLMs) possess extensive knowledge and exhibit promising reasoning abilities, however, they still struggle to perform well in complex, dynamic environments. Real-world tasks require handling intricate interactions, advanced spatial reasoning, long-term planning, and continuous exploration of new strategies-areas in which we lack effective methodologies for comprehensively evaluating these capabilities. To address this gap, we introduce BALROG, a novel benchmark designed to assess the agentic capabilities of LLMs and VLMs through a diverse set of challenging games. Our benchmark incorporates a range of existing reinforcement learning environments with varying levels of difficulty, including tasks that are solvable by non-expert humans in seconds to extremely challenging ones that may take years to master (e.g., the NetHack Learning Environment). We devise fine-grained metrics to measure performance and conduct an extensive evaluation of several popular open-source and closed-source LLMs and VLMs. Our findings indicate that while current models achieve partial success in the easier games, they struggle significantly with more challenging tasks. Notably, we observe severe deficiencies in vision-based decision-making, as several models perform worse when visual representations of the environments are provided. We release BALROG as an open and user-friendly benchmark to facilitate future research and development in the agentic community. Code and Leaderboard at balrogai.com.

BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games

TL;DR

<p>BALROG addresses the need for rigorous, long-horizon evaluation of agentic capabilities in LLMs and VLMs by introducing a diverse benchmark comprising six challenging, procedurally generated games (BabyAI, Crafter, TextWorld, Baba Is AI, MiniHack, NetHack) with fine-grained, 0–100 scoring and an open toolkit for zero-shot and inference-time prompting evaluation. The framework decouples prompting strategies from models, enabling rapid prototyping and fair comparisons, and introduces a data-informed NetHack progression metric to better reflect true task advancement. Across multiple state-of-the-art models and both language-only and vision-language modalities, BALROG reveals that current systems excel on easy tasks but struggle drastically on hard, long-horizon tasks, with vision-based decision-making particularly challenging. The paper also discusses open research directions, including ICL, advanced reasoning, VLM limitations, and computational bottlenecks, and provides an open-source platform to spur future progress toward autonomous agentic agents.</p>

Abstract

Large Language Models (LLMs) and Vision Language Models (VLMs) possess extensive knowledge and exhibit promising reasoning abilities, however, they still struggle to perform well in complex, dynamic environments. Real-world tasks require handling intricate interactions, advanced spatial reasoning, long-term planning, and continuous exploration of new strategies-areas in which we lack effective methodologies for comprehensively evaluating these capabilities. To address this gap, we introduce BALROG, a novel benchmark designed to assess the agentic capabilities of LLMs and VLMs through a diverse set of challenging games. Our benchmark incorporates a range of existing reinforcement learning environments with varying levels of difficulty, including tasks that are solvable by non-expert humans in seconds to extremely challenging ones that may take years to master (e.g., the NetHack Learning Environment). We devise fine-grained metrics to measure performance and conduct an extensive evaluation of several popular open-source and closed-source LLMs and VLMs. Our findings indicate that while current models achieve partial success in the easier games, they struggle significantly with more challenging tasks. Notably, we observe severe deficiencies in vision-based decision-making, as several models perform worse when visual representations of the environments are provided. We release BALROG as an open and user-friendly benchmark to facilitate future research and development in the agentic community. Code and Leaderboard at balrogai.com.

Paper Structure

This paper contains 45 sections, 11 figures, 16 tables.

Figures (11)

  • Figure 1: An overview of the BALROG Benchmark for evaluating LLMs on long-context interactive tasks. Submissions of new inference-time methods for improving the capabilities of an existing model via an "agentic strategy" need only modify the agent.py file. Similarly, benchmarking a new model zero-shot can be done by adjusting a configuration file in client.py. The agent class includes a prompt builder to manage observation history, and a client that abstracts the complexities of various APIs and model-serving frameworks. The env_wrapper.py file standardizes interaction across settings, and the evaluator executes agents and collects performance metrics.
  • Figure 2: Baselines for BALROG. We evaluate the zero-shot performance of seven state-of-the-art and long-context LLMs and VLMs on BALROG. During each timestep of interaction, models are prompted to output the next in-game action conditioned on past interaction history. Standard error is obtained by running multiple replicate seeds, as detailed in the Appendix.
  • Figure 3: Crafter's examples of unique procedurally generated maps.
  • Figure 4: Crafter's example of 3d scene visualization with Minecraft 3d models and textures.
  • Figure 5: TextWorld interface along with visualization.
  • ...and 6 more figures