BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuciński, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, Tim Rocktäschel
TL;DR
<p>BALROG addresses the need for rigorous, long-horizon evaluation of agentic capabilities in LLMs and VLMs by introducing a diverse benchmark comprising six challenging, procedurally generated games (BabyAI, Crafter, TextWorld, Baba Is AI, MiniHack, NetHack) with fine-grained, 0–100 scoring and an open toolkit for zero-shot and inference-time prompting evaluation. The framework decouples prompting strategies from models, enabling rapid prototyping and fair comparisons, and introduces a data-informed NetHack progression metric to better reflect true task advancement. Across multiple state-of-the-art models and both language-only and vision-language modalities, BALROG reveals that current systems excel on easy tasks but struggle drastically on hard, long-horizon tasks, with vision-based decision-making particularly challenging. The paper also discusses open research directions, including ICL, advanced reasoning, VLM limitations, and computational bottlenecks, and provides an open-source platform to spur future progress toward autonomous agentic agents.</p>
Abstract
Large Language Models (LLMs) and Vision Language Models (VLMs) possess extensive knowledge and exhibit promising reasoning abilities, however, they still struggle to perform well in complex, dynamic environments. Real-world tasks require handling intricate interactions, advanced spatial reasoning, long-term planning, and continuous exploration of new strategies-areas in which we lack effective methodologies for comprehensively evaluating these capabilities. To address this gap, we introduce BALROG, a novel benchmark designed to assess the agentic capabilities of LLMs and VLMs through a diverse set of challenging games. Our benchmark incorporates a range of existing reinforcement learning environments with varying levels of difficulty, including tasks that are solvable by non-expert humans in seconds to extremely challenging ones that may take years to master (e.g., the NetHack Learning Environment). We devise fine-grained metrics to measure performance and conduct an extensive evaluation of several popular open-source and closed-source LLMs and VLMs. Our findings indicate that while current models achieve partial success in the easier games, they struggle significantly with more challenging tasks. Notably, we observe severe deficiencies in vision-based decision-making, as several models perform worse when visual representations of the environments are provided. We release BALROG as an open and user-friendly benchmark to facilitate future research and development in the agentic community. Code and Leaderboard at balrogai.com.
