Table of Contents
Fetching ...

EcoScratch: Cost-Effective Multimodal Repair for Scratch Using Execution Feedback

Yuan Si, Ming Wang, Daming Li, Hanyuan Shi, Jialu Zhang

Abstract

Scratch is the most popular programming environment for novices, with over 1.15 billion projects created worldwide. Unlike traditional languages, correctness in Scratch is defined by visible behavior on the stage rather than by code structure alone, so programs that appear correct in the workspace can still fail at runtime due to timing, event ordering, or cross-sprite interactions. Visual execution evidence such as gameplay videos can therefore be essential for diagnosis and repair. However, capturing and processing this evidence inside an automated repair loop introduces substantial overhead. Probing execution, recording stage behavior, rebuilding executable .sb3 projects, and verifying candidate fixes consume time, monetary cost, and resources across an entire repair trajectory rather than a single model call. We present EcoScratch, a repair pipeline that uses lightweight runtime signals to decide whether the next attempt stays text-only or escalates to multimodal prompting. The controller also sets the JSON Patch budget and verification effort, so evidence choice and repair budget are coupled inside the same decision. EcoScratch rebuilds candidate fixes into executable .sb3 projects and records per-trajectory traces, monetary cost, local-runtime energy. We evaluate 12 models on 100 executable Scratch repair projects under four controller settings, yielding 4800 repair trajectories. In this matrix, a selective multimodal policy gives the strongest observed success-cost-energy tradeoff. It reaches the highest generation success (30.3%) while using less average cost and local-runtime energy than the two non-adaptive multimodal baselines under the same bounded trajectory budget; text-only remains the lowest-cost floor. Across the evaluated matrix, multimodal evidence helps most when it is used to control escalation within a bounded trajectory budget rather than applied uniformly.

EcoScratch: Cost-Effective Multimodal Repair for Scratch Using Execution Feedback

Abstract

Scratch is the most popular programming environment for novices, with over 1.15 billion projects created worldwide. Unlike traditional languages, correctness in Scratch is defined by visible behavior on the stage rather than by code structure alone, so programs that appear correct in the workspace can still fail at runtime due to timing, event ordering, or cross-sprite interactions. Visual execution evidence such as gameplay videos can therefore be essential for diagnosis and repair. However, capturing and processing this evidence inside an automated repair loop introduces substantial overhead. Probing execution, recording stage behavior, rebuilding executable .sb3 projects, and verifying candidate fixes consume time, monetary cost, and resources across an entire repair trajectory rather than a single model call. We present EcoScratch, a repair pipeline that uses lightweight runtime signals to decide whether the next attempt stays text-only or escalates to multimodal prompting. The controller also sets the JSON Patch budget and verification effort, so evidence choice and repair budget are coupled inside the same decision. EcoScratch rebuilds candidate fixes into executable .sb3 projects and records per-trajectory traces, monetary cost, local-runtime energy. We evaluate 12 models on 100 executable Scratch repair projects under four controller settings, yielding 4800 repair trajectories. In this matrix, a selective multimodal policy gives the strongest observed success-cost-energy tradeoff. It reaches the highest generation success (30.3%) while using less average cost and local-runtime energy than the two non-adaptive multimodal baselines under the same bounded trajectory budget; text-only remains the lowest-cost floor. Across the evaluated matrix, multimodal evidence helps most when it is used to control escalation within a bounded trajectory budget rather than applied uniformly.

Paper Structure

This paper contains 21 sections, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: A Scratch project in the standard editor view. The scripting area exposes block structure, while the stage shows the behavior that ultimately determines whether the project works. The figure illustrates the mismatch studied in this paper: a code-local edit can look plausible in the scripting workspace yet still leave the visible runtime behavior wrong.
  • Figure 2: EcoScratch's end-to-end repair pipeline. Left: the runtime probe executes the buggy project, computes coarse signals, and may capture stage images. Middle: under the selected controller mode in Table \ref{['tab:mode-glossary']}, the system maps those signals to a repair plan, builds a bounded JSON-patch request, applies the returned patch, and runs staged verification with retry/refine feedback. Right: the pipeline exports outcome labels and trajectory-level trace, cost, and local-runtime energy records. All four controller modes share the same runtime; only evidence policy and repair-plan selection differ.
  • Figure 3: Model-level aggregates for all 48 model--mode cells in the $12 \times 4$ matrix. Each point is one concrete model aggregated over 100 project trajectories. Panel titles encode controller mode; the legend encodes provider family, with blue-outlined hollow points for OpenAI models and orange-outlined hollow points for Gemini models. Panel (a) plots generation success against average cost per project on a log-scaled x-axis, and panel (b) plots the same success measure against average local-runtime energy.
  • Figure 4: Per-model deltas of Heuristic relative to the two non-adaptive multimodal baselines. Each point denotes one concrete model aggregated over 100 project trajectories; the legend encodes provider family, with blue-outlined hollow points for OpenAI models and orange-outlined hollow points for Gemini models. Panel (a) uses Always-on multimodal as the baseline and panel (b) uses Fixed multimodal. The x-axis is cost reduction (%), the y-axis is generation-success gain (pp), and all points lie in the positive-positive quadrant.