DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks

Hyunjun Kim; Sooyoung Ryu

DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks

Hyunjun Kim, Sooyoung Ryu

TL;DR

DrawingBench introduces a verifiable framework to evaluate spatial reasoning in agentic LLMs by requiring sequences of mouse-based GUI actions to draw on a canvas. The benchmark combines 250 tasks across 20 categories and four difficulty levels with eight objective criteria and four error types, enabling transparent, rule-based scoring and action-level auditability. A two-turn, externally guided feedback protocol demonstrates that structured oversight improves performance (average +3.2%, up to +32.8% in complex scenes) and reduces variance, while specification clarity often matters more than task complexity. Across 1,000 tests from four leading LLMs, the results reveal strong baseline spatial reasoning in text-only settings and highlight remaining challenges in tool-state management and long-horizon planning, motivating future vision-based extensions and broader task coverage. The work provides an open-source, reproducible template for trustworthy agent assessment in interactive, spatially grounded tasks.

Abstract

As agentic AI systems increasingly operate autonomously, establishing trust through verifiable evaluation becomes critical. Yet existing benchmarks lack the transparency and auditability needed to assess whether agents behave reliably. We present DrawingBench, a verification framework for evaluating the trustworthiness of agentic LLMs through spatial reasoning tasks that require generating sequences of low-level GUI actions. Unlike opaque evaluations, DrawingBench provides transparent, rule-based assessment: 8 objective criteria enable reproducible scoring, while action-level inspection allows stakeholders to audit agent behavior. Our framework comprises 250 diverse prompts across 20 categories and 4 difficulty levels, deterministic evaluation metrics, and an external oversight mechanism through multi-turn feedback that enables human control over agent refinement. Evaluating four state-of-the-art LLMs (Claude-4 Sonnet, GPT-4.1, GPT-4.1-mini, Gemini-2.5 Flash) across 1,000 tests, we establish both capabilities and limitations: models achieved 92.8% perfect performance with structured external feedback driving significant improvements (average +3.2%, up to +32.8% for complex scenes), but systematic error patterns emerged in tool state management and long-horizon planning. Notably, specification clarity proved more important than task complexity -- models achieved 100% perfect performance when given explicit, verifiable criteria. These findings demonstrate that transparent evaluation frameworks can establish trust in agentic systems, with external oversight proving more reliable than self-correction for guiding agent behavior. Our open-source framework provides a template for trustworthy agent assessment. Code and data: https://github.com/hyunjun1121/DrawingBench

DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks

TL;DR

Abstract

DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)