Table of Contents
Fetching ...

A Benchmark for Math Misconceptions: Bridging Gaps in Middle School Algebra with AI-Supported Instruction

Otero Nancy, Druga Stefania, Lan Andrew

TL;DR

Addressing gaps in middle school algebra education, the paper introduces a benchmark of $55$ algebra misconceptions (MaEs) with $220$ diagnostic items to support AI-driven diagnosis of student thinking. It evaluates GPT-4-turbo on in-context learning across two experiments, reporting $0.526$/$0.529$ precision/recall at the MaE level and $0.753$/$0.748$ under topic-constrained testing, with educator feedback boosting accuracy to $83.91\%$. Educators (around $80\%$) found the misconceptions relevant and expressed interest in AI-assisted diagnosis. The work highlights the value of topic-constrained testing, calls for multimodal data, and emphasizes human-in-the-loop design for practical classroom deployment.

Abstract

This study introduces an evaluation benchmark for middle school algebra to be used in artificial intelligence(AI) based educational platforms. The goal is to support the design of AI systems that can enhance learner conceptual understanding of algebra by taking into account their current level of algebra comprehension. The data set comprises 55 misconceptions about algebra, common errors, and 220 diagnostic examples identified in previous peer-reviewed studies. We provide an example application using a large language model, observing a range of precision and recall scores depending on the topic and experimental setup that reaches 83.9% when including educator feedback and restricting it by topic. We found that topics such as ratios and proportions prove as difficult for LLMs as they are for students. We included a human assessment of LLMs results and feedback from five middle school math educators on the clarity and occurrence of misconceptions in the dataset and the potential use of AI in conjunction with the dataset. Most educators (80% or more) indicated that they encounter these misconceptions among their students, suggesting the relevance of the data set to teaching middle school algebra. Despite varying familiarity with AI tools, four out of five educators expressed interest in using the data set with AI to diagnose student misconceptions or train teachers. The results emphasize the importance of topic-constrained testing, the need for multimodal approaches, and the relevance of human expertise to gain practical insights when using AI for human learning.

A Benchmark for Math Misconceptions: Bridging Gaps in Middle School Algebra with AI-Supported Instruction

TL;DR

Addressing gaps in middle school algebra education, the paper introduces a benchmark of algebra misconceptions (MaEs) with diagnostic items to support AI-driven diagnosis of student thinking. It evaluates GPT-4-turbo on in-context learning across two experiments, reporting / precision/recall at the MaE level and / under topic-constrained testing, with educator feedback boosting accuracy to . Educators (around ) found the misconceptions relevant and expressed interest in AI-assisted diagnosis. The work highlights the value of topic-constrained testing, calls for multimodal data, and emphasizes human-in-the-loop design for practical classroom deployment.

Abstract

This study introduces an evaluation benchmark for middle school algebra to be used in artificial intelligence(AI) based educational platforms. The goal is to support the design of AI systems that can enhance learner conceptual understanding of algebra by taking into account their current level of algebra comprehension. The data set comprises 55 misconceptions about algebra, common errors, and 220 diagnostic examples identified in previous peer-reviewed studies. We provide an example application using a large language model, observing a range of precision and recall scores depending on the topic and experimental setup that reaches 83.9% when including educator feedback and restricting it by topic. We found that topics such as ratios and proportions prove as difficult for LLMs as they are for students. We included a human assessment of LLMs results and feedback from five middle school math educators on the clarity and occurrence of misconceptions in the dataset and the potential use of AI in conjunction with the dataset. Most educators (80% or more) indicated that they encounter these misconceptions among their students, suggesting the relevance of the data set to teaching middle school algebra. Despite varying familiarity with AI tools, four out of five educators expressed interest in using the data set with AI to diagnose student misconceptions or train teachers. The results emphasize the importance of topic-constrained testing, the need for multimodal approaches, and the relevance of human expertise to gain practical insights when using AI for human learning.

Paper Structure

This paper contains 19 sections, 16 figures.

Figures (16)

  • Figure 1: Example of a training set
  • Figure 2: Example of a test set
  • Figure 3: Results for precision and recall per MaE
  • Figure 4: Results for precision and recall per MaE per topic
  • Figure 5: Scores by Experiment and Topic
  • ...and 11 more figures