MATHWELL: Generating Educational Math Word Problems Using Teacher Annotations

Bryan R Christ; Jonathan Kropko; Thomas Hartvigsen

MATHWELL: Generating Educational Math Word Problems Using Teacher Annotations

Bryan R Christ, Jonathan Kropko, Thomas Hartvigsen

TL;DR

The model, MATHWELL, is the first K-8 word problem generator targeted at educational appropriateness and finds MATHWELL generates problems far more solvable, accurate, and appropriate than public models.

Abstract

Math word problems are critical K-8 educational tools, but writing them is time consuming and requires extensive expertise. To be educational, problems must be solvable, have accurate answers, and, most importantly, be educationally appropriate. We propose that language models have potential to support K-8 math education by automatically generating word problems. However, evaluating educational appropriateness is hard to quantify. We fill this gap by having teachers evaluate problems generated by LLMs, who find existing models and data often fail to be educationally appropriate. We then explore automatically generating educational word problems, ultimately using our expert annotations to finetune a 70B language model. Our model, MATHWELL, is the first K-8 word problem generator targeted at educational appropriateness. Further expert studies find MATHWELL generates problems far more solvable, accurate, and appropriate than public models. MATHWELL also matches GPT-4's problem quality while attaining more appropriate reading levels for K-8 students and avoiding generating harmful questions.

MATHWELL: Generating Educational Math Word Problems Using Teacher Annotations

TL;DR

Abstract

Paper Structure (129 sections, 10 figures, 13 tables)

This paper contains 129 sections, 10 figures, 13 tables.

Introduction
Related Work
Math QA Datasets
MWP Generation
Methods
Human Evaluation Criteria
Automatic Evaluation Criteria
Evaluating Existing Datasets
Expert Annotation
Further Finetuning on High-quality Outputs
EGSM Dataset Characteristics
Evaluating MATHWELL
Human Evaluation
MATHWELL Matches SOTA in Human Evaluation Criteria
MATHWELL Generates High-quality, Complex Questions
...and 114 more sections

Figures (10)

Figure 1: Generating educational math word problems with language models. To be educational, problems must simultaneously be solvable, accurate, and educationally appropriate.
Figure 2: Llama-2 (70B) performance with 95% confidence intervals on our human evaluation metrics under different prompting/training scenarios. FT is supervised finetuning.
Figure 3: MATHWELL training and EGSM generation process. SFT is supervised finetuning and MaC denotes outputs that meet all criteria.
Figure 4: Flesch-Kincaid grade level (FKGL) distribution of model MWPs. Dotted lines show mean FKGL.
Figure 5: Solvability directions.
...and 5 more figures

MATHWELL: Generating Educational Math Word Problems Using Teacher Annotations

TL;DR

Abstract

MATHWELL: Generating Educational Math Word Problems Using Teacher Annotations

Authors

TL;DR

Abstract

Table of Contents

Figures (10)