Table of Contents
Fetching ...

A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers

Shen-Yun Miao, Chao-Chun Liang, Keh-Yih Su

TL;DR

This work tackles the problem of reliably evaluating English math word problem solvers by introducing ASDiv, a diverse corpus of 2,305 MWPs annotated with problem type and grade level to cover a wide range of language patterns and mathematical types. It introduces a BLEU-based lexicon usage diversity (LD) metric and the derived corpus-level diversity (CLD) to quantify diversity, demonstrating that ASDiv is more diverse than existing datasets. Empirical results show current state-of-the-art solvers underperform on ASDiv (≈36% accuracy) and that lower diversity (lower CLD) can inflate apparent performance, underscoring the need for diverse benchmarks that mirror real human tests. The work argues that grade level is a useful difficulty indicator and that diversity-focused corpus construction leads to more faithful assessments of solver capabilities, with implications for robust MWP development and evaluation.

Abstract

We present ASDiv (Academia Sinica Diverse MWP Dataset), a diverse (in terms of both language patterns and problem types) English math word problem (MWP) corpus for evaluating the capability of various MWP solvers. Existing MWP corpora for studying AI progress remain limited either in language usage patterns or in problem types. We thus present a new English MWP corpus with 2,305 MWPs that cover more text patterns and most problem types taught in elementary school. Each MWP is annotated with its problem type and grade level (for indicating the level of difficulty). Furthermore, we propose a metric to measure the lexicon usage diversity of a given MWP corpus, and demonstrate that ASDiv is more diverse than existing corpora. Experiments show that our proposed corpus reflects the true capability of MWP solvers more faithfully.

A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers

TL;DR

This work tackles the problem of reliably evaluating English math word problem solvers by introducing ASDiv, a diverse corpus of 2,305 MWPs annotated with problem type and grade level to cover a wide range of language patterns and mathematical types. It introduces a BLEU-based lexicon usage diversity (LD) metric and the derived corpus-level diversity (CLD) to quantify diversity, demonstrating that ASDiv is more diverse than existing datasets. Empirical results show current state-of-the-art solvers underperform on ASDiv (≈36% accuracy) and that lower diversity (lower CLD) can inflate apparent performance, underscoring the need for diverse benchmarks that mirror real human tests. The work argues that grade level is a useful difficulty indicator and that diversity-focused corpus construction leads to more faithful assessments of solver capabilities, with implications for robust MWP development and evaluation.

Abstract

We present ASDiv (Academia Sinica Diverse MWP Dataset), a diverse (in terms of both language patterns and problem types) English math word problem (MWP) corpus for evaluating the capability of various MWP solvers. Existing MWP corpora for studying AI progress remain limited either in language usage patterns or in problem types. We thus present a new English MWP corpus with 2,305 MWPs that cover more text patterns and most problem types taught in elementary school. Each MWP is annotated with its problem type and grade level (for indicating the level of difficulty). Furthermore, we propose a metric to measure the lexicon usage diversity of a given MWP corpus, and demonstrate that ASDiv is more diverse than existing corpora. Experiments show that our proposed corpus reflects the true capability of MWP solvers more faithfully.

Paper Structure

This paper contains 9 sections, 3 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Lexicon usage diversity of various corpora.
  • Figure 2: Distribution of PT categories (G1G6)
  • Figure 3: Syntactic pattern diversity of various corpora
  • Figure 4: Lexicon usage diversity of various corpora: test-set versus training-set
  • Figure 5: Syntactic pattern diversity of various corpora: test-set versus training-set
  • ...and 2 more figures