Table of Contents
Fetching ...

The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors

Li Lucy, Albert Zhang, Nathan Anderson, Ryan Knight, Kyle Lo

TL;DR

An extensive, year-long snapshot of how 11 vision-language models (VLMs) perform on DrawEduMath, a QA benchmark involving real students' handwritten, hand-drawn responses to math problems, finds that models'weaknesses concentrate on a core component of math education: student error.

Abstract

Effective mathematics education requires identifying and responding to students' mistakes. For AI to support pedagogical applications, models must perform well across different levels of student proficiency. Our work provides an extensive, year-long snapshot of how 11 vision-language models (VLMs) perform on DrawEduMath, a QA benchmark involving real students' handwritten, hand-drawn responses to math problems. We find that models' weaknesses concentrate on a core component of math education: student error. All evaluated VLMs underperform when describing work from students who require more pedagogical help, and across all QA, they struggle the most on questions related to assessing student error. Thus, while VLMs may be optimized to be math problem solving experts, our results suggest that they require alternative development incentives to adequately support educational use cases.

The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors

TL;DR

An extensive, year-long snapshot of how 11 vision-language models (VLMs) perform on DrawEduMath, a QA benchmark involving real students' handwritten, hand-drawn responses to math problems, finds that models'weaknesses concentrate on a core component of math education: student error.

Abstract

Effective mathematics education requires identifying and responding to students' mistakes. For AI to support pedagogical applications, models must perform well across different levels of student proficiency. Our work provides an extensive, year-long snapshot of how 11 vision-language models (VLMs) perform on DrawEduMath, a QA benchmark involving real students' handwritten, hand-drawn responses to math problems. We find that models' weaknesses concentrate on a core component of math education: student error. All evaluated VLMs underperform when describing work from students who require more pedagogical help, and across all QA, they struggle the most on questions related to assessing student error. Thus, while VLMs may be optimized to be math problem solving experts, our results suggest that they require alternative development incentives to adequately support educational use cases.
Paper Structure (36 sections, 1 equation, 10 figures, 3 tables)

This paper contains 36 sections, 1 equation, 10 figures, 3 tables.

Figures (10)

  • Figure 1: On the left is a math problem, where students are asked to draw $x < 5/2$ on a number line. The right side shows two example student responses that differ in correctness. DrawEduMath pairs each math problem with one student response, and prompts VLMs to answer questions about the student response.
  • Figure 2: VLMs consistently perform worse on answering DrawEduMath benchmark questions pertaining to erroneous student responses. Performance on non-erroneous student responses ($\mathord{ { \hbox{$\m@th\bigcirc$} } }$) is labeled with specific VLMs' names; that same model's performance on erroneous student responses is directly below ($\square$). Error bars are 95% CI.
  • Figure 3: Content description QA consistently drives the gap in VLM performance between student responses that contain errors versus those that do not. Appendix \ref{['sec:disagg_results']} includes additional VLMs that expand this finding.
  • Figure 4: An example of how a student response image (top) is transformed and cleaned up by our digital redrawing process (bottom). This student uses a place value chart to show how digit values change for 345 after division by 100.
  • Figure 5: Models' performance for content description QA generally improves after images are redrawn. Error bars are 95% CI.
  • ...and 5 more figures