From Correctness to Comprehension: AI Agents for Personalized Error Diagnosis in Education
Yi-Fan Zhang, Hang Li, Dingjie Song, Lichao Sun, Tianlong Xu, Qingsong Wen
TL;DR
This work tackles the gap in AI-driven education where LLMs prioritize correctness over diagnosing student errors. It introduces MathCCS, a multimodal error-analysis benchmark with expert-annotated categories and real-world data, plus a sequential error-analysis dataset to capture learning trajectories, and proposes a two-agent framework (Time Series Agent + MLLM Agent) to fuse historical patterns with real-time reasoning. Across benchmarks, current MLLMs underperform human educators in error classification and actionable feedback, prompting the need for structured diagnostic platforms; the hybrid framework substantially improves diagnostic quality by leveraging temporal context for robust error analysis and tailored recommendations. The approach advances personalized education by enabling adaptive, context-aware feedback grounded in longitudinal student data, and points to future work in deeper multimodal integration and scalable annotation to better support teaching needs.
Abstract
Large Language Models (LLMs), such as GPT-4, have demonstrated impressive mathematical reasoning capabilities, achieving near-perfect performance on benchmarks like GSM8K. However, their application in personalized education remains limited due to an overemphasis on correctness over error diagnosis and feedback generation. Current models fail to provide meaningful insights into the causes of student mistakes, limiting their utility in educational contexts. To address these challenges, we present three key contributions. First, we introduce \textbf{MathCCS} (Mathematical Classification and Constructive Suggestions), a multi-modal benchmark designed for systematic error analysis and tailored feedback. MathCCS includes real-world problems, expert-annotated error categories, and longitudinal student data. Evaluations of state-of-the-art models, including \textit{Qwen2-VL}, \textit{LLaVA-OV}, \textit{Claude-3.5-Sonnet} and \textit{GPT-4o}, reveal that none achieved classification accuracy above 30\% or generated high-quality suggestions (average scores below 4/10), highlighting a significant gap from human-level performance. Second, we develop a sequential error analysis framework that leverages historical data to track trends and improve diagnostic precision. Finally, we propose a multi-agent collaborative framework that combines a Time Series Agent for historical analysis and an MLLM Agent for real-time refinement, enhancing error classification and feedback generation. Together, these contributions provide a robust platform for advancing personalized education, bridging the gap between current AI capabilities and the demands of real-world teaching.
