Table of Contents
Fetching ...

From Correctness to Comprehension: AI Agents for Personalized Error Diagnosis in Education

Yi-Fan Zhang, Hang Li, Dingjie Song, Lichao Sun, Tianlong Xu, Qingsong Wen

TL;DR

This work tackles the gap in AI-driven education where LLMs prioritize correctness over diagnosing student errors. It introduces MathCCS, a multimodal error-analysis benchmark with expert-annotated categories and real-world data, plus a sequential error-analysis dataset to capture learning trajectories, and proposes a two-agent framework (Time Series Agent + MLLM Agent) to fuse historical patterns with real-time reasoning. Across benchmarks, current MLLMs underperform human educators in error classification and actionable feedback, prompting the need for structured diagnostic platforms; the hybrid framework substantially improves diagnostic quality by leveraging temporal context for robust error analysis and tailored recommendations. The approach advances personalized education by enabling adaptive, context-aware feedback grounded in longitudinal student data, and points to future work in deeper multimodal integration and scalable annotation to better support teaching needs.

Abstract

Large Language Models (LLMs), such as GPT-4, have demonstrated impressive mathematical reasoning capabilities, achieving near-perfect performance on benchmarks like GSM8K. However, their application in personalized education remains limited due to an overemphasis on correctness over error diagnosis and feedback generation. Current models fail to provide meaningful insights into the causes of student mistakes, limiting their utility in educational contexts. To address these challenges, we present three key contributions. First, we introduce \textbf{MathCCS} (Mathematical Classification and Constructive Suggestions), a multi-modal benchmark designed for systematic error analysis and tailored feedback. MathCCS includes real-world problems, expert-annotated error categories, and longitudinal student data. Evaluations of state-of-the-art models, including \textit{Qwen2-VL}, \textit{LLaVA-OV}, \textit{Claude-3.5-Sonnet} and \textit{GPT-4o}, reveal that none achieved classification accuracy above 30\% or generated high-quality suggestions (average scores below 4/10), highlighting a significant gap from human-level performance. Second, we develop a sequential error analysis framework that leverages historical data to track trends and improve diagnostic precision. Finally, we propose a multi-agent collaborative framework that combines a Time Series Agent for historical analysis and an MLLM Agent for real-time refinement, enhancing error classification and feedback generation. Together, these contributions provide a robust platform for advancing personalized education, bridging the gap between current AI capabilities and the demands of real-world teaching.

From Correctness to Comprehension: AI Agents for Personalized Error Diagnosis in Education

TL;DR

This work tackles the gap in AI-driven education where LLMs prioritize correctness over diagnosing student errors. It introduces MathCCS, a multimodal error-analysis benchmark with expert-annotated categories and real-world data, plus a sequential error-analysis dataset to capture learning trajectories, and proposes a two-agent framework (Time Series Agent + MLLM Agent) to fuse historical patterns with real-time reasoning. Across benchmarks, current MLLMs underperform human educators in error classification and actionable feedback, prompting the need for structured diagnostic platforms; the hybrid framework substantially improves diagnostic quality by leveraging temporal context for robust error analysis and tailored recommendations. The approach advances personalized education by enabling adaptive, context-aware feedback grounded in longitudinal student data, and points to future work in deeper multimodal integration and scalable annotation to better support teaching needs.

Abstract

Large Language Models (LLMs), such as GPT-4, have demonstrated impressive mathematical reasoning capabilities, achieving near-perfect performance on benchmarks like GSM8K. However, their application in personalized education remains limited due to an overemphasis on correctness over error diagnosis and feedback generation. Current models fail to provide meaningful insights into the causes of student mistakes, limiting their utility in educational contexts. To address these challenges, we present three key contributions. First, we introduce \textbf{MathCCS} (Mathematical Classification and Constructive Suggestions), a multi-modal benchmark designed for systematic error analysis and tailored feedback. MathCCS includes real-world problems, expert-annotated error categories, and longitudinal student data. Evaluations of state-of-the-art models, including \textit{Qwen2-VL}, \textit{LLaVA-OV}, \textit{Claude-3.5-Sonnet} and \textit{GPT-4o}, reveal that none achieved classification accuracy above 30\% or generated high-quality suggestions (average scores below 4/10), highlighting a significant gap from human-level performance. Second, we develop a sequential error analysis framework that leverages historical data to track trends and improve diagnostic precision. Finally, we propose a multi-agent collaborative framework that combines a Time Series Agent for historical analysis and an MLLM Agent for real-time refinement, enhancing error classification and feedback generation. Together, these contributions provide a robust platform for advancing personalized education, bridging the gap between current AI capabilities and the demands of real-world teaching.

Paper Structure

This paper contains 23 sections, 2 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Each data sample in the dataset includes traditional question-answer pairs along with students' responses. Additionally, we provide students' drafts and a detailed analysis of the problems to furnish the model with more contextual information. Each student has multiple time-step data points, which support the construction of user profiles and enable the delivery of personalized recommendations. Finally, we include annotations from educational experts that identify the reasons for errors in the problems, along with relevant suggestions for improvement.
  • Figure 2: An overview of the MathCCS error categorization framework, showcasing the major error categories and their corresponding subcategories. Detailed explanations for each subcategory are provided in Table \ref{['tab:defi']}. The framework is meticulously designed by educational experts, comprising 9 major error categories and 29 subcategories, covering the most prevalent error types observed among elementary-grade students.
  • Figure 3: Left: Distribution of student counts in sequential data, representing the number of data points per student. The minimum value is 21, and the maximum value is 184, showcasing a diverse and realistic dataset of student interaction records in educational scenarios. Right: Distribution of error categories identified by GPT-4O in the dataset. The data exhibits a pronounced long-tail pattern, highlighting the uneven frequency of different error types.
  • Figure 4: The collaborative framework leverages two agents to analyze and understand student problem-solving patterns, along with error diagnosis and recommendations. The Time Series Agent processes historical data on the student's problem-solving behavior to make initial predictions. These preliminary insights are then refined by the MLLM Agent, which employs advanced reasoning capabilities to provide detailed error classifications and context-specific recommendations for improvement. The red-highlighted interface represents the output of the time-series model, which is passed to the MLLM agent for downstream error classification and reasoning. If the MLLM's performance is evaluated on individual sample points without leveraging the temporal context, this part of the interface is not required.
  • Figure 5: The time-series model architecture consists of modality-specific encoders, an MLP mapping layer, and a pre-normalization layer as the input processing module, which aligns data from different modalities before feeding it into the time-series transformer layer.
  • ...and 1 more figures