Table of Contents
Fetching ...

Mathfish: Evaluating Language Model Math Reasoning via Grounding in Educational Curricula

Li Lucy, Tal August, Rose E. Wang, Luca Soldaini, Courtney Allison, Kyle Lo

Abstract

To ensure that math curriculum is grade-appropriate and aligns with critical skills or concepts in accordance with educational standards, pedagogical experts can spend months carefully reviewing published math problems. Drawing inspiration from this process, our work presents a novel angle for evaluating language models' (LMs) mathematical abilities, by investigating whether they can discern skills and concepts enabled by math content. We contribute two datasets: one consisting of 385 fine-grained descriptions of K-12 math skills and concepts, or standards, from Achieve the Core (ATC), and another of 9.9K math problems labeled with these standards (MathFish). We develop two tasks for evaluating LMs' abilities to assess math problems: (1) verifying whether a problem aligns with a given standard, and (2) tagging a problem with all aligned standards. Working with experienced teachers, we find that LMs struggle to tag and verify standards linked to problems, and instead predict labels that are close to ground truth, but differ in subtle ways. We also show that LMs often generate problems that do not fully align with standards described in prompts, suggesting the need for careful scrutiny on use cases involving LMs for generating curricular materials. Finally, we categorize problems in GSM8k using math standards, allowing us to better understand why some problems are more difficult to solve for models than others.

Mathfish: Evaluating Language Model Math Reasoning via Grounding in Educational Curricula

Abstract

To ensure that math curriculum is grade-appropriate and aligns with critical skills or concepts in accordance with educational standards, pedagogical experts can spend months carefully reviewing published math problems. Drawing inspiration from this process, our work presents a novel angle for evaluating language models' (LMs) mathematical abilities, by investigating whether they can discern skills and concepts enabled by math content. We contribute two datasets: one consisting of 385 fine-grained descriptions of K-12 math skills and concepts, or standards, from Achieve the Core (ATC), and another of 9.9K math problems labeled with these standards (MathFish). We develop two tasks for evaluating LMs' abilities to assess math problems: (1) verifying whether a problem aligns with a given standard, and (2) tagging a problem with all aligned standards. Working with experienced teachers, we find that LMs struggle to tag and verify standards linked to problems, and instead predict labels that are close to ground truth, but differ in subtle ways. We also show that LMs often generate problems that do not fully align with standards described in prompts, suggesting the need for careful scrutiny on use cases involving LMs for generating curricular materials. Finally, we categorize problems in GSM8k using math standards, allowing us to better understand why some problems are more difficult to solve for models than others.
Paper Structure (56 sections, 23 figures, 8 tables)

This paper contains 56 sections, 23 figures, 8 tables.

Figures (23)

  • Figure 1: An example of a MathFishproblem, along with domains ($\mathcal{D}$), clusters ($\mathcal{C}$), and standards ($\mathcal{S}$) it does and does not align with. Solid lines indicate hierarchical relationships, while a dashed line links conceptually connected standards. In addition, this figure illustrates two task formats: verification (§\ref{['sec:entail']}) and tagging (§\ref{['sec:tag']}).
  • Figure 2: Verification accuracy when problems are paired with aligned standards (+) or with unaligned standards, ordered from left to right in increasing similarity to the positive standard ($\mathcal{D'}\mathcal{G'}$$\rightarrow$$\mathcal{D'}\mathcal{G}$$\rightarrow$$\mathcal{D}\mathcal{G'}$$\rightarrow$$\mathcal{D}\mathcal{G}$$\rightarrow$$\mathcal{N}$). Language models have difficulty performing verification as standards become increasingly similar.
  • Figure 3: Average per-branch accuracy at each level ($\mathcal{D}$, $\mathcal{C}$, $\mathcal{S}$) of the tagging tree during assisted traversal. The dashed line indicates a random baseline accuracy of 0.5. Stronger models decrease in performance when asked to make more granular decisions.
  • Figure 4: Models' problem solving performance, based on the grade levels of $\mathcal{S}$ tagged in problems.
  • Figure 5: Verification accuracy when problems are paired with aligned standards (+) or with unaligned standards, ordered from left to right in increasing similarity to the positive standard ($\mathcal{D'}\mathcal{G'}$$\rightarrow$$\mathcal{D'}\mathcal{G}$$\rightarrow$$\mathcal{D}\mathcal{G'}$$\rightarrow$$\mathcal{D}\mathcal{G}$$\rightarrow$$\mathcal{N}$). Language models have difficulty performing verification as standards become increasingly similar.
  • ...and 18 more figures