A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers
Shen-Yun Miao, Chao-Chun Liang, Keh-Yih Su
TL;DR
This work tackles the problem of reliably evaluating English math word problem solvers by introducing ASDiv, a diverse corpus of 2,305 MWPs annotated with problem type and grade level to cover a wide range of language patterns and mathematical types. It introduces a BLEU-based lexicon usage diversity (LD) metric and the derived corpus-level diversity (CLD) to quantify diversity, demonstrating that ASDiv is more diverse than existing datasets. Empirical results show current state-of-the-art solvers underperform on ASDiv (≈36% accuracy) and that lower diversity (lower CLD) can inflate apparent performance, underscoring the need for diverse benchmarks that mirror real human tests. The work argues that grade level is a useful difficulty indicator and that diversity-focused corpus construction leads to more faithful assessments of solver capabilities, with implications for robust MWP development and evaluation.
Abstract
We present ASDiv (Academia Sinica Diverse MWP Dataset), a diverse (in terms of both language patterns and problem types) English math word problem (MWP) corpus for evaluating the capability of various MWP solvers. Existing MWP corpora for studying AI progress remain limited either in language usage patterns or in problem types. We thus present a new English MWP corpus with 2,305 MWPs that cover more text patterns and most problem types taught in elementary school. Each MWP is annotated with its problem type and grade level (for indicating the level of difficulty). Furthermore, we propose a metric to measure the lexicon usage diversity of a given MWP corpus, and demonstrate that ASDiv is more diverse than existing corpora. Experiments show that our proposed corpus reflects the true capability of MWP solvers more faithfully.
