Table of Contents
Fetching ...

SketchMind: A Multi-Agent Cognitive Framework for Assessing Student-Drawn Scientific Sketches

Ehsan Latif, Zirak Khan, Xiaoming Zhai

TL;DR

SketchMind presents a cognitively grounded, multi-agent framework for assessing student-drawn scientific sketches by modeling sketches as Bloom-annotated Sketch Reasoning Graphs (SRGs). It decomposes the task into four agents—rubric parsing, perception, cognitive alignment, and feedback/modification—enabling transparent, formative feedback and iterative sketch improvements. Empirical results on NGSS-aligned data show that SRG supervision substantially boosts sketch-prediction accuracy across state-of-the-art models (e.g., GPT-4.1 achieving about $90.2\%$ average with SRG) and that the multi-agent approach outperforms single-agent baselines. Human experts rate the feedback and revised sketches highly when paired with strong LLMs, highlighting the framework’s potential to support conceptual growth in science education and to provide interpretable, pedagogically aligned reasoning for free-form sketches.

Abstract

Scientific sketches (e.g., models) offer a powerful lens into students' conceptual understanding, yet AI-powered automated assessment of such free-form, visually diverse artifacts remains a critical challenge. Existing solutions often treat sketch evaluation as either an image classification task or monolithic vision-language models, which lack interpretability, pedagogical alignment, and adaptability across cognitive levels. To address these limitations, we present SketchMind, a cognitively grounded, multi-agent framework for evaluating and improving student-drawn scientific sketches. SketchMind comprises modular agents responsible for rubric parsing, sketch perception, cognitive alignment, and iterative feedback with sketch modification, enabling personalized and transparent evaluation. We evaluate SketchMind on a curated dataset of 3,575 student-generated sketches across six science assessment items with different highest order of Bloom's level that require students to draw models to explain phenomena. Compared to baseline GPT-4o performance without SRG (average accuracy: 55.6%), and with SRG integration achieves 77.1% average accuracy (+21.4% average absolute gain). We also demonstrate that multi-agent orchestration with SRG enhances SketchMind performance, for example, GPT-4.1 gains an average 8.9% increase in sketch prediction accuracy, outperforming single-agent pipelines across all items. Human evaluators rated the feedback and co-created sketches generated by \textsc{SketchMind} with GPT-4.1, which achieved an average of 4.1 out of 5, significantly higher than those of baseline models (e.g., 2.3 for GPT-4o). Experts noted the system's potential to meaningfully support conceptual growth through guided revision. Our code and (pending approval) dataset will be released to support reproducibility and future research in AI-driven education.

SketchMind: A Multi-Agent Cognitive Framework for Assessing Student-Drawn Scientific Sketches

TL;DR

SketchMind presents a cognitively grounded, multi-agent framework for assessing student-drawn scientific sketches by modeling sketches as Bloom-annotated Sketch Reasoning Graphs (SRGs). It decomposes the task into four agents—rubric parsing, perception, cognitive alignment, and feedback/modification—enabling transparent, formative feedback and iterative sketch improvements. Empirical results on NGSS-aligned data show that SRG supervision substantially boosts sketch-prediction accuracy across state-of-the-art models (e.g., GPT-4.1 achieving about average with SRG) and that the multi-agent approach outperforms single-agent baselines. Human experts rate the feedback and revised sketches highly when paired with strong LLMs, highlighting the framework’s potential to support conceptual growth in science education and to provide interpretable, pedagogically aligned reasoning for free-form sketches.

Abstract

Scientific sketches (e.g., models) offer a powerful lens into students' conceptual understanding, yet AI-powered automated assessment of such free-form, visually diverse artifacts remains a critical challenge. Existing solutions often treat sketch evaluation as either an image classification task or monolithic vision-language models, which lack interpretability, pedagogical alignment, and adaptability across cognitive levels. To address these limitations, we present SketchMind, a cognitively grounded, multi-agent framework for evaluating and improving student-drawn scientific sketches. SketchMind comprises modular agents responsible for rubric parsing, sketch perception, cognitive alignment, and iterative feedback with sketch modification, enabling personalized and transparent evaluation. We evaluate SketchMind on a curated dataset of 3,575 student-generated sketches across six science assessment items with different highest order of Bloom's level that require students to draw models to explain phenomena. Compared to baseline GPT-4o performance without SRG (average accuracy: 55.6%), and with SRG integration achieves 77.1% average accuracy (+21.4% average absolute gain). We also demonstrate that multi-agent orchestration with SRG enhances SketchMind performance, for example, GPT-4.1 gains an average 8.9% increase in sketch prediction accuracy, outperforming single-agent pipelines across all items. Human evaluators rated the feedback and co-created sketches generated by \textsc{SketchMind} with GPT-4.1, which achieved an average of 4.1 out of 5, significantly higher than those of baseline models (e.g., 2.3 for GPT-4o). Experts noted the system's potential to meaningfully support conceptual growth through guided revision. Our code and (pending approval) dataset will be released to support reproducibility and future research in AI-driven education.

Paper Structure

This paper contains 19 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of SRG Generation. Given a multi-model question including image, textual description of question and an expert-designed textual rubric for student sketch performance evaluation, Agent 1 processes the information and extracts SRG components and builds Level 4 Bloom's taxonomy ordered SRG (Bloom level ) to set the Gold standard for further evaluation and sketch modification.
  • Figure 2: Sample sketch drawn by student and Agent 2 to extract perceived SRG. (a) student's drawn sketch, (b) Agent 2's perceived SRG based on the given sketch.
  • Figure 3: Cognitive alignment score and feedback for the perceived SRG (See Figure \ref{['fig:agent2_process']}) generated by Agent 3 after similarity score calculations.
  • Figure 4: Sample modified SRG based on the given feedback and score by Agent 3 and updated sketch with embedded python toolkit.