Table of Contents
Fetching ...

Mathematics, word problems, common sense, and artificial intelligence

Ernest Davis

TL;DR

The paper analyzes the capabilities and limits of AI, particularly LLM-based approaches, to solve word problems that blend elementary math with commonsense reasoning. It reviews output-based, code-generation, and autoformalization pathways, and surveys benchmarks like SVAMP and LILA to assess progress. It shows that no current system reliably solves elementary CSWs and that performance is highly uneven across domains, with artifacts complicating evaluation. It discusses implications for practical applications and for reading human mathematical content, while remaining cautious about relevance to pure mathematical research.

Abstract

The paper discusses the capacities and limitations of current artificial intelligence (AI) technology to solve word problems that combine elementary knowledge with commonsense reasoning. No existing AI systems can solve these reliably. We review three approaches that have been developed, using AI natural language technology: outputting the answer directly, outputting a computer program that solves the problem, and outputting a formalized representation that can be input to an automated theorem verifier. We review some benchmarks that have been developed to evaluate these systems and some experimental studies. We discuss the limitations of the existing technology at solving these kinds of problems. We argue that it is not clear whether these kinds of limitations will be important in developing AI technology for pure mathematical research, but that they will be important in applications of mathematics, and may well be important in developing programs capable of reading and understanding mathematical content written by humans.

Mathematics, word problems, common sense, and artificial intelligence

TL;DR

The paper analyzes the capabilities and limits of AI, particularly LLM-based approaches, to solve word problems that blend elementary math with commonsense reasoning. It reviews output-based, code-generation, and autoformalization pathways, and surveys benchmarks like SVAMP and LILA to assess progress. It shows that no current system reliably solves elementary CSWs and that performance is highly uneven across domains, with artifacts complicating evaluation. It discusses implications for practical applications and for reading human mathematical content, while remaining cautious about relevance to pure mathematical research.

Abstract

The paper discusses the capacities and limitations of current artificial intelligence (AI) technology to solve word problems that combine elementary knowledge with commonsense reasoning. No existing AI systems can solve these reliably. We review three approaches that have been developed, using AI natural language technology: outputting the answer directly, outputting a computer program that solves the problem, and outputting a formalized representation that can be input to an automated theorem verifier. We review some benchmarks that have been developed to evaluate these systems and some experimental studies. We discuss the limitations of the existing technology at solving these kinds of problems. We argue that it is not clear whether these kinds of limitations will be important in developing AI technology for pure mathematical research, but that they will be important in applications of mathematics, and may well be important in developing programs capable of reading and understanding mathematical content written by humans.
Paper Structure (14 sections, 7 tables)