From Handwriting to Feedback: Evaluating VLMs and LLMs for AI-Powered Assessment in Indonesian Classrooms

Nurul Aisyah; Muhammad Dehan Al Kautsar; Arif Hidayat; Raqib Chowdhury; Fajri Koto

From Handwriting to Feedback: Evaluating VLMs and LLMs for AI-Powered Assessment in Indonesian Classrooms

Nurul Aisyah, Muhammad Dehan Al Kautsar, Arif Hidayat, Raqib Chowdhury, Fajri Koto

TL;DR

This work tackles AI-powered assessment in underrepresented classrooms by building a multimodal pipeline that couples a vision-language model for OCR of handwritten student work with large language models for rubric-guided scoring and Indonesian feedback. The authors release a first-of-its-kind dataset of over 14K Grade-4 handwritten responses in Mathematics and English from six Indonesian schools, and they systematically compare several VLM/LLM configurations. They find that GPT-4o with vision input provides the closest alignment to human grading, especially for essays, while LLM-generated feedback remains pedagogically useful despite OCR noise; however, personalization and contextual relevance require further improvement. The study also analyzes urban–rural disparities and OCR error propagation, underscoring the need for localization and careful deployment of AI-assisted assessment in low-resource, multilingual education settings.

Abstract

Despite rapid progress in vision-language and large language models (VLMs and LLMs), their effectiveness for AI-driven educational assessment in real-world, underrepresented classrooms remains largely unexplored. We evaluate state-of-the-art VLMs and LLMs on over 14K handwritten answers from grade-4 classrooms in Indonesia, covering Mathematics and English aligned with the local national curriculum. Unlike prior work on clean digital text, our dataset features naturally curly, diverse handwriting from real classrooms, posing realistic visual and linguistic challenges. Assessment tasks include grading and generating personalized Indonesian feedback guided by rubric-based evaluation. Results show that the VLM struggles with handwriting recognition, causing error propagation in LLM grading, yet LLM feedback remains pedagogically useful despite imperfect visual inputs, revealing limits in personalization and contextual relevance.

From Handwriting to Feedback: Evaluating VLMs and LLMs for AI-Powered Assessment in Indonesian Classrooms

TL;DR

Abstract

From Handwriting to Feedback: Evaluating VLMs and LLMs for AI-Powered Assessment in Indonesian Classrooms

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)