Table of Contents
Fetching ...

AI-Assisted Generation of Difficult Math Questions

Vedant Shah, Dingli Yu, Kaifeng Lyu, Simon Park, Jiatong Yu, Yinghui He, Nan Rosemary Ke, Michael Mozer, Yoshua Bengio, Sanjeev Arora, Anirudh Goyal

TL;DR

The paper tackles the shortage of diverse, difficult math evaluation data for large language models by introducing an AI–human in-the-loop pipeline that uses LLMs' metacognitive abilities to extract math skills from the MATH dataset and generate novel questions by pairing random skill-sets. This process yields MATH^2, a dataset that is harder for models but provides stronger in-context exemplars, with a notable quadratic relation between model performance on the original MATH dataset and the new MATH^2 set, suggesting genuine compositional reasoning requirements. Empirically, MATH^2 is shown to reduce model performance across the board, yet its questions improve downstream performance when used as exemplars, highlighting the value of high-quality synthetic data validated by humans. The framework is positioned as scalable and adaptable to other domains requiring structured reasoning, offering a potential path toward scalable oversight in AI systems.

Abstract

Current LLM training positions mathematical reasoning as a core capability. With publicly available sources fully tapped, there is unmet demand for diverse and challenging math questions. Relying solely on human experts is both time-consuming and costly, while LLM-generated questions often lack the requisite diversity and difficulty. We present a design framework that combines the strengths of LLMs with a human-in-the-loop approach to generate a diverse array of challenging math questions. We leverage LLM metacognition skills [Didolkar et al., 2024] of a strong LLM to extract core "skills" from existing math datasets. These skills serve as the basis for generating novel and difficult questions by prompting the LLM with random pairs of core skills. The use of two different skills within each question makes finding such questions an "out of distribution" task for both LLMs and humans. Our pipeline employs LLMs to iteratively generate and refine questions and solutions through multiturn prompting. Human annotators then verify and further refine the questions, with their efficiency enhanced via further LLM interactions. Applying this pipeline on skills extracted from the MATH dataset [Hendrycks et al., 2021] resulted in MATH$^2$ - a dataset of higher-quality math questions, as evidenced by: (a) Lower performance of all models on MATH$^2$ than on MATH (b) Higher performance on MATH when using MATH$^2$ questions as in-context examples. Although focused on mathematics, our methodology seems applicable to other domains requiring structured reasoning, and potentially as a component of scalable oversight. Also of interest is a striking relationship observed between models' performance on the new dataset: the success rate on MATH$^2$ is the square on MATH, suggesting that successfully solving the question in MATH$^2$ requires a nontrivial combination of two distinct math skills.

AI-Assisted Generation of Difficult Math Questions

TL;DR

The paper tackles the shortage of diverse, difficult math evaluation data for large language models by introducing an AI–human in-the-loop pipeline that uses LLMs' metacognitive abilities to extract math skills from the MATH dataset and generate novel questions by pairing random skill-sets. This process yields MATH^2, a dataset that is harder for models but provides stronger in-context exemplars, with a notable quadratic relation between model performance on the original MATH dataset and the new MATH^2 set, suggesting genuine compositional reasoning requirements. Empirically, MATH^2 is shown to reduce model performance across the board, yet its questions improve downstream performance when used as exemplars, highlighting the value of high-quality synthetic data validated by humans. The framework is positioned as scalable and adaptable to other domains requiring structured reasoning, offering a potential path toward scalable oversight in AI systems.

Abstract

Current LLM training positions mathematical reasoning as a core capability. With publicly available sources fully tapped, there is unmet demand for diverse and challenging math questions. Relying solely on human experts is both time-consuming and costly, while LLM-generated questions often lack the requisite diversity and difficulty. We present a design framework that combines the strengths of LLMs with a human-in-the-loop approach to generate a diverse array of challenging math questions. We leverage LLM metacognition skills [Didolkar et al., 2024] of a strong LLM to extract core "skills" from existing math datasets. These skills serve as the basis for generating novel and difficult questions by prompting the LLM with random pairs of core skills. The use of two different skills within each question makes finding such questions an "out of distribution" task for both LLMs and humans. Our pipeline employs LLMs to iteratively generate and refine questions and solutions through multiturn prompting. Human annotators then verify and further refine the questions, with their efficiency enhanced via further LLM interactions. Applying this pipeline on skills extracted from the MATH dataset [Hendrycks et al., 2021] resulted in MATH - a dataset of higher-quality math questions, as evidenced by: (a) Lower performance of all models on MATH than on MATH (b) Higher performance on MATH when using MATH questions as in-context examples. Although focused on mathematics, our methodology seems applicable to other domains requiring structured reasoning, and potentially as a component of scalable oversight. Also of interest is a striking relationship observed between models' performance on the new dataset: the success rate on MATH is the square on MATH, suggesting that successfully solving the question in MATH requires a nontrivial combination of two distinct math skills.
Paper Structure (38 sections, 24 equations, 4 figures, 10 tables)

This paper contains 38 sections, 24 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: AI-assisted question generation: This figure outlines a five-step pipeline for generating high-quality questions. (a) Skill Pair Validation – The model ensures the given skills are distinct. (b) Question Generation – The model is asked to generate a question requiring both skills. (c) Attempted Solution – The model is asked to solve the question with a defeatist approach. (d) Question Validation – The question is assessed for correctness, rigor, and clarity, etc. (e) Final Solution – Valid questions are re-solved using advanced techniques like in-context prompting and majority voting.
  • Figure 2: Comparison of Zero-Shot Performance of Various Models on MATH and new Dataset MATH$^2$. - This figure illustrates the zero-shot Chain of Thought (CoT) performance of both open-source and proprietary models on two different datasets: MATH and MATH$^2$ - our generated dataset. Across the board, models demonstrate a lower performance on the generated dataset compared to MATH. Models show consistent drops in performances relative to MATH when evaluated on MATH$^2$. Detailed numerical values related to this comparison are available in Table \ref{['tab:main']}.
  • Figure 3: Relation between the performance of models on MATH$^2$ ($Y$) vs the square of their performances on MATH ($X^2$). As can be seen from the plot, $Y \approx X^2$. DeepSeek-R1-Distill-Llama-8B shows the largest positive deviation from the trend, whereas Claude-3.5 Sonnet shows the largest negative deviation.
  • Figure 4: Shows the distribution of different skills extracted during the skill extraction process in the generated set of questions. The generated and human verified set of 210 questions consists of 109 skills out of the 114 skills extracted via the skill extraction process as described in Didolkar2024MetacognitiveCO, Each question in the generated set represents two skills. The top 2 most frequently occurring skills are number_theory_skills and perimeter_and_area. Note that the distribution of skills is not uniform with there being multiple skills that are represented by one one question.