Table of Contents
Fetching ...

ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs

Wanjia Zhao, Ludwig Schmidt, James Zou, Vidhisha Balachandran, Lingjiao Chen

Abstract

Tool-augmented large language models (LLMs) must tightly couple multi-step reasoning with external actions, yet existing benchmarks often confound this interplay with complex environment dynamics, memorized knowledge or dataset contamination. In this paper, we introduce ZebraArena, a procedurally generated diagnostic environment for studying reasoning-action coupling in tool-augmented LLMs, with controllable difficulty and a knowledge-minimal design, which limits gains from memorization or dataset contamination. Each task in ZebraArena requires a set of critical information which is available only through targeted tool use, yielding an interpretable interface between external information acquisition and deductive reasoning. This design provides deterministic evaluation via unique solutions, and a theoretical optimal query count for measuring efficient tool use. We show that ZebraArena requires a combination of in-depth reasoning and accurate external tool calling, which remains a challenge as frontier reasoning models such as GPT-5 and Gemini 2.5 Pro only achieves 60% accuracy on the hard instances. We also observe a persistent gaps between theoretical optimality and practical tool usage. For example, GPT-5 uses 70-270% more tool calls than the theoretical optimum. We highlight the key findings in our evaluation, and hope ZebraArena stimulates further research on the interplay between internal reasoning and external action.

ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs

Abstract

Tool-augmented large language models (LLMs) must tightly couple multi-step reasoning with external actions, yet existing benchmarks often confound this interplay with complex environment dynamics, memorized knowledge or dataset contamination. In this paper, we introduce ZebraArena, a procedurally generated diagnostic environment for studying reasoning-action coupling in tool-augmented LLMs, with controllable difficulty and a knowledge-minimal design, which limits gains from memorization or dataset contamination. Each task in ZebraArena requires a set of critical information which is available only through targeted tool use, yielding an interpretable interface between external information acquisition and deductive reasoning. This design provides deterministic evaluation via unique solutions, and a theoretical optimal query count for measuring efficient tool use. We show that ZebraArena requires a combination of in-depth reasoning and accurate external tool calling, which remains a challenge as frontier reasoning models such as GPT-5 and Gemini 2.5 Pro only achieves 60% accuracy on the hard instances. We also observe a persistent gaps between theoretical optimality and practical tool usage. For example, GPT-5 uses 70-270% more tool calls than the theoretical optimum. We highlight the key findings in our evaluation, and hope ZebraArena stimulates further research on the interplay between internal reasoning and external action.
Paper Structure (47 sections, 12 equations, 3 figures, 17 tables)

This paper contains 47 sections, 12 equations, 3 figures, 17 tables.

Figures (3)

  • Figure 1: This ZebraArena example has 3 houses $(N=3)$, 3 attributes $(M=3)$, and 5 clues $(K=5)$, with one withheld as a missing clue. The Background defines attributes and uniqueness constraints, and the Given Clues provide visible constraints. The agent solves the puzzle by reasoning over the given clues and, when needed, querying the ToolBox to retrieve missing information from the environment. The final assignment is shown in the Solution grid.
  • Figure 2: Overall accuracy and hierarchical tool-use diagnostics on ZebraArena. Left: accuracy for each model under three Tool Environment Types (Normal / Only-Fact / Only-Relation). Right: Interaction counts aligned with our 4-level evaluation metrics: Steps, Total Queries, Valid Queries, Effective Queries. Gaps between bars isolate failures at each stage, with the red dashed line marking the optimal lower bound $K^\star$.
  • Figure 3: Scaling behavior of tool-augmented reasoning on ZebraArena. Left: Interaction cost as task complexity increases, measured by total steps and total tool calls, shown as a function of search-space size (top) and number of missing clues $K^\star$ (bottom). Middle: Accuracy under the same scaling axes. Right: tool-use efficiency (inefficiency ratio and effectiveness rate), highlighting the widening gap between optimal and realized tool use as uncertainty grows.