Table of Contents
Fetching ...

Let's Verify Math Questions Step by Step

Chengyu Shen, Zhen Hao Wong, Runming He, Hao Liang, Meiyi Qiang, Zimo Meng, Zhengyang Zhao, Bohan Zeng, Zhengzhou Zhu, Bin Cui, Wentao Zhang

TL;DR

MathQ-Verify tackles the problem of ill-posed math questions in QA datasets by introducing a five-stage pipeline that decomposes each question into atomic conditions $P_i$ and goals $G_j$, then applies contamination, linguistic, consistency, and completeness checks. It complements these checks with a multi-model voting strategy to boost robustness, and accompanies the pipeline with ValiMath, a step-labeled benchmark of 2,147 questions derived from NuminaMath to enable fine-grained evaluation. The approach achieves state-of-the-art performance on MathClean and MVQ-2K, boosting $F1$ by up to 25 percentage points and reaching approximate precision of $90\%$ and recall of $63\%$ with voting. Overall, MathQ-Verify offers a scalable solution for building reliable mathematical QA datasets by reducing label noise and unnecessary computation on invalid questions; code and data are publicly available.

Abstract

Large Language Models (LLMs) have recently achieved remarkable progress in mathematical reasoning. To enable such capabilities, many existing works distill strong reasoning models into long chains of thought or design algorithms to construct high-quality math QA data for training. However, these efforts primarily focus on generating correct reasoning paths and answers, while largely overlooking the validity of the questions themselves. In this work, we propose Math Question Verification (MathQ-Verify), a novel five-stage pipeline designed to rigorously filter ill-posed or under-specified math problems. MathQ-Verify first performs format-level validation to remove redundant instructions and ensure that each question is syntactically well-formed. It then formalizes each question, decomposes it into atomic conditions, and verifies them against mathematical definitions. Next, it detects logical contradictions among these conditions, followed by a goal-oriented completeness check to ensure the question provides sufficient information for solving. To evaluate this task, we use existing benchmarks along with an additional dataset we construct, containing 2,147 math questions with diverse error types, each manually double-validated. Experiments show that MathQ-Verify achieves state-of-the-art performance across multiple benchmarks, improving the F1 score by up to 25 percentage points over the direct verification baseline. It further attains approximately 90% precision and 63% recall through a lightweight model voting scheme. MathQ-Verify offers a scalable and accurate solution for curating reliable mathematical datasets, reducing label noise and avoiding unnecessary computation on invalid questions. Our code and data are available at https://github.com/scuuy/MathQ-Verify.

Let's Verify Math Questions Step by Step

TL;DR

MathQ-Verify tackles the problem of ill-posed math questions in QA datasets by introducing a five-stage pipeline that decomposes each question into atomic conditions and goals , then applies contamination, linguistic, consistency, and completeness checks. It complements these checks with a multi-model voting strategy to boost robustness, and accompanies the pipeline with ValiMath, a step-labeled benchmark of 2,147 questions derived from NuminaMath to enable fine-grained evaluation. The approach achieves state-of-the-art performance on MathClean and MVQ-2K, boosting by up to 25 percentage points and reaching approximate precision of and recall of with voting. Overall, MathQ-Verify offers a scalable solution for building reliable mathematical QA datasets by reducing label noise and unnecessary computation on invalid questions; code and data are publicly available.

Abstract

Large Language Models (LLMs) have recently achieved remarkable progress in mathematical reasoning. To enable such capabilities, many existing works distill strong reasoning models into long chains of thought or design algorithms to construct high-quality math QA data for training. However, these efforts primarily focus on generating correct reasoning paths and answers, while largely overlooking the validity of the questions themselves. In this work, we propose Math Question Verification (MathQ-Verify), a novel five-stage pipeline designed to rigorously filter ill-posed or under-specified math problems. MathQ-Verify first performs format-level validation to remove redundant instructions and ensure that each question is syntactically well-formed. It then formalizes each question, decomposes it into atomic conditions, and verifies them against mathematical definitions. Next, it detects logical contradictions among these conditions, followed by a goal-oriented completeness check to ensure the question provides sufficient information for solving. To evaluate this task, we use existing benchmarks along with an additional dataset we construct, containing 2,147 math questions with diverse error types, each manually double-validated. Experiments show that MathQ-Verify achieves state-of-the-art performance across multiple benchmarks, improving the F1 score by up to 25 percentage points over the direct verification baseline. It further attains approximately 90% precision and 63% recall through a lightweight model voting scheme. MathQ-Verify offers a scalable and accurate solution for curating reliable mathematical datasets, reducing label noise and avoiding unnecessary computation on invalid questions. Our code and data are available at https://github.com/scuuy/MathQ-Verify.

Paper Structure

This paper contains 37 sections, 8 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: An overview of our MathQ-Verify framework. Given a math question, we extract atomic conditions $P_i$ and the goal $G_i$, and conduct a five-step verification process: (1) detection of contaminated instructions, (2) linguistic error detection, (3) atomic condition verification, (4) cross-condition conflict detection, and (5) condition completeness validation. Only questions that pass all checks are retained as correct structured math questions.
  • Figure 2: An example of an incorrect atomic condition: a negative area contradicts fundamental mathematical definitions.
  • Figure 3: Distribution heatmaps of correct questions across difficulty levels and primary mathematical categories. The left heatmap shows human-annotated correct question distribution, while the right heatmap shows the distribution of questions validated as correct by MathQ-Verify after filtering.
  • Figure 4: Overview of the ValiMath Construction Pipeline