Annotating Derivations: A New Evaluation Strategy and Dataset for Algebra Word Problems

Shyam Upadhyay; Ming-Wei Chang

Annotating Derivations: A New Evaluation Strategy and Dataset for Algebra Word Problems

Shyam Upadhyay, Ming-Wei Chang

TL;DR

The paper addresses evaluating algebra word problem solvers more reliably by focusing on derivations—the constructive path from word problem to equation system. It formalizes derivation structure, equivalence, and an algorithm to compare derivations, and introduces semi-automatic derivation annotation, including a new DRAW-1K dataset with over 2300 annotated problems. Experiments show derivation accuracy provides stricter, more informative evaluation than solution or equation-based metrics, revealing errors those metrics miss. The work also demonstrates practical benefits for dataset fusion and outperforms existing solvers on the provided benchmarks, with implications for education-oriented AI.

Abstract

We propose a new evaluation for automatic solvers for algebra word problems, which can identify mistakes that existing evaluations overlook. Our proposal is to evaluate such solvers using derivations, which reflect how an equation system was constructed from the word problem. To accomplish this, we develop an algorithm for checking the equivalence between two derivations, and show how derivation an- notations can be semi-automatically added to existing datasets. To make our experiments more comprehensive, we include the derivation annotation for DRAW-1K, a new dataset containing 1000 general algebra word problems. In our experiments, we found that the annotated derivations enable a more accurate evaluation of automatic solvers than previously used metrics. We release derivation annotations for over 2300 algebra word problems for future evaluations.

Annotating Derivations: A New Evaluation Strategy and Dataset for Algebra Word Problems

TL;DR

Abstract

Annotating Derivations: A New Evaluation Strategy and Dataset for Algebra Word Problems

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)