SketchMind: A Multi-Agent Cognitive Framework for Assessing Student-Drawn Scientific Sketches
Ehsan Latif, Zirak Khan, Xiaoming Zhai
TL;DR
SketchMind presents a cognitively grounded, multi-agent framework for assessing student-drawn scientific sketches by modeling sketches as Bloom-annotated Sketch Reasoning Graphs (SRGs). It decomposes the task into four agents—rubric parsing, perception, cognitive alignment, and feedback/modification—enabling transparent, formative feedback and iterative sketch improvements. Empirical results on NGSS-aligned data show that SRG supervision substantially boosts sketch-prediction accuracy across state-of-the-art models (e.g., GPT-4.1 achieving about $90.2\%$ average with SRG) and that the multi-agent approach outperforms single-agent baselines. Human experts rate the feedback and revised sketches highly when paired with strong LLMs, highlighting the framework’s potential to support conceptual growth in science education and to provide interpretable, pedagogically aligned reasoning for free-form sketches.
Abstract
Scientific sketches (e.g., models) offer a powerful lens into students' conceptual understanding, yet AI-powered automated assessment of such free-form, visually diverse artifacts remains a critical challenge. Existing solutions often treat sketch evaluation as either an image classification task or monolithic vision-language models, which lack interpretability, pedagogical alignment, and adaptability across cognitive levels. To address these limitations, we present SketchMind, a cognitively grounded, multi-agent framework for evaluating and improving student-drawn scientific sketches. SketchMind comprises modular agents responsible for rubric parsing, sketch perception, cognitive alignment, and iterative feedback with sketch modification, enabling personalized and transparent evaluation. We evaluate SketchMind on a curated dataset of 3,575 student-generated sketches across six science assessment items with different highest order of Bloom's level that require students to draw models to explain phenomena. Compared to baseline GPT-4o performance without SRG (average accuracy: 55.6%), and with SRG integration achieves 77.1% average accuracy (+21.4% average absolute gain). We also demonstrate that multi-agent orchestration with SRG enhances SketchMind performance, for example, GPT-4.1 gains an average 8.9% increase in sketch prediction accuracy, outperforming single-agent pipelines across all items. Human evaluators rated the feedback and co-created sketches generated by \textsc{SketchMind} with GPT-4.1, which achieved an average of 4.1 out of 5, significantly higher than those of baseline models (e.g., 2.3 for GPT-4o). Experts noted the system's potential to meaningfully support conceptual growth through guided revision. Our code and (pending approval) dataset will be released to support reproducibility and future research in AI-driven education.
