Table of Contents
Fetching ...

See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch

Xingyi Zhang, Yulei Ye, Kaifeng Huang, Wenhao Li, Xiangfeng Wang

TL;DR

ScratchWorld introduces the first benchmark specifically targeting multimodal GUI agents in block-based Scratch, focusing on program-by-construction tasks. By separating reasoning from visuomotor execution through primitive and composite modes and validating solutions via execution-based tests in the Scratch VM, the paper exposes a substantial reasoning-acting gap: models plan well but struggle with precise drag-and-drop and endpoint localization. Diagnostics reveal that perception is not the primary bottleneck; instead, robust, snap-aware interaction policies are needed for reliable program construction. The benchmark combines a diverse task set, a rigorous construction pipeline, and an execution-grounded evaluation framework to advance research on AI-assisted Scratch tutoring and low-code education.

Abstract

Block-based programming environments such as Scratch play a central role in low-code education, yet evaluating the capabilities of AI agents to construct programs through Graphical User Interfaces (GUIs) remains underexplored. We introduce ScratchWorld, a benchmark for evaluating multimodal GUI agents on program-by-construction tasks in Scratch. Grounded in the Use-Modify-Create pedagogical framework, ScratchWorld comprises 83 curated tasks spanning four distinct problem categories: Create, Debug, Extend, and Compute. To rigorously diagnose the source of agent failures, the benchmark employs two complementary interaction modes: primitive mode requires fine-grained drag-and-drop manipulation to directly assess visuomotor control, while composite mode uses high-level semantic APIs to disentangle program reasoning from GUI execution. To ensure reliable assessment, we propose an execution-based evaluation protocol that validates the functional correctness of the constructed Scratch programs through runtime tests within the browser environment. Extensive experiments across state-of-the-art multimodal language models and GUI agents reveal a substantial reasoning--acting gap, highlighting persistent challenges in fine-grained GUI manipulation despite strong planning capabilities.

See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch

TL;DR

ScratchWorld introduces the first benchmark specifically targeting multimodal GUI agents in block-based Scratch, focusing on program-by-construction tasks. By separating reasoning from visuomotor execution through primitive and composite modes and validating solutions via execution-based tests in the Scratch VM, the paper exposes a substantial reasoning-acting gap: models plan well but struggle with precise drag-and-drop and endpoint localization. Diagnostics reveal that perception is not the primary bottleneck; instead, robust, snap-aware interaction policies are needed for reliable program construction. The benchmark combines a diverse task set, a rigorous construction pipeline, and an execution-grounded evaluation framework to advance research on AI-assisted Scratch tutoring and low-code education.

Abstract

Block-based programming environments such as Scratch play a central role in low-code education, yet evaluating the capabilities of AI agents to construct programs through Graphical User Interfaces (GUIs) remains underexplored. We introduce ScratchWorld, a benchmark for evaluating multimodal GUI agents on program-by-construction tasks in Scratch. Grounded in the Use-Modify-Create pedagogical framework, ScratchWorld comprises 83 curated tasks spanning four distinct problem categories: Create, Debug, Extend, and Compute. To rigorously diagnose the source of agent failures, the benchmark employs two complementary interaction modes: primitive mode requires fine-grained drag-and-drop manipulation to directly assess visuomotor control, while composite mode uses high-level semantic APIs to disentangle program reasoning from GUI execution. To ensure reliable assessment, we propose an execution-based evaluation protocol that validates the functional correctness of the constructed Scratch programs through runtime tests within the browser environment. Extensive experiments across state-of-the-art multimodal language models and GUI agents reveal a substantial reasoning--acting gap, highlighting persistent challenges in fine-grained GUI manipulation despite strong planning capabilities.
Paper Structure (43 sections, 4 equations, 15 figures, 8 tables)

This paper contains 43 sections, 4 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: Overview of ScratchWorld evaluation workflow through a Debug task: fixing incorrect paddle control in Pong. Agents interact via two modes: primitive mode using GUI operations (drag-and-drop) or composite mode using high-level APIs (delete block). Execution-based evaluation leverages Scratch VM to validate functional correctness.
  • Figure 2: Example tasks for four problem categories in ScratchWorld. Create tasks require creating interactive projects from scratch, such as a balloon game. Debug tasks involve diagnosing and correcting bugs, illustrated by fixing a coordinate error in a paddle controller. Extend tasks challenge agents to extend existing functionality, e.g., adding password authentication for a circuit switch. Compute tasks focus on pure computational logic, such as implementing a factorial calculator.
  • Figure 3: JSON schema of the primitive action space $\mathcal{A}_{\text{prim}}$ in ScratchWorld. This specification defines low-level GUI primitives that agents must execute in primitive mode. The action space includes: mouse operations (click, double_click, drag_and_drop, scroll) with coordinate-based or index-based targeting; keyboard operations (type, key, hotkey); and task completion signals (done, failed).
  • Figure 4: JSON schema of the composite action space $\mathcal{A}_{\text{comp}}$ in ScratchWorld. This specification defines high-level semantic APIs that agents use in composite mode to manipulate Scratch programs without visual grounding. The action space includes: target selection (select_sprite, select_stage), block manipulation (add_block with optional variable/list creation, connect_blocks with placement strategies, set_block_field for parameter configuration, delete_block), and task completion signals.
  • Figure 5: Example of the composite mode observation. It includes environmental context (variables, lists, and targets) and the block structure (pseudocode).
  • ...and 10 more figures