Table of Contents
Fetching ...

TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation

Jialin Ouyang

TL;DR

TreeCut addresses the challenge of reliably evaluating deep mathematical reasoning in large language models by introducing a synthetic, tree-based generation framework that creates infinite unanswerable math word problems. By removing chosen necessary conditions along the path to the queried variable, TreeCut yields controllable unanswerable instances (tunable by $numVars$, $ansDepth$, $cutDepth$, and $compositeName$) and their answerable counterparts, enabling precise analysis of when LLMs hallucinate. Empirical results show substantial zero-shot hallucination rates for GPT-4o and o3-mini, with deeper, more complex trees, composite item names, and mid-path cuts increasing vulnerability; few-shot prompts moderately mitigate hallucinations for some models. The work provides a public dataset and generation code to facilitate targeted investigations into LLM reasoning, unanswerability detection, and prompting strategies, with potential implications for safer deployment in mathematical reasoning tasks.

Abstract

Large language models (LLMs) now achieve near-human performance on standard math word problem benchmarks (e.g., GSM8K), yet their true reasoning ability remains disputed. A key concern is that models often produce confident, yet unfounded, answers to unanswerable problems. We introduce TreeCut, a synthetic dataset that systematically generates infinite unanswerable math word problems and their answerable counterparts, by representing each question as a tree and removing chosen necessary conditions. Experiments show TreeCut effectively induce hallucinations in large language models, including GPT-4o and o3-mini, with rates of 64% and 44% in their respective worst-case scenarios under zero-shot setting. Further analysis highlights that deeper or more complex trees, composite item names, and removing necessary condition near the middle of a path all increase the likelihood of hallucinations, underscoring the persistent challenges LLMs face in identifying unanswerable math problems. The dataset generation code and sample data are available at https://github.com/j-bagel/treecut-math.

TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation

TL;DR

TreeCut addresses the challenge of reliably evaluating deep mathematical reasoning in large language models by introducing a synthetic, tree-based generation framework that creates infinite unanswerable math word problems. By removing chosen necessary conditions along the path to the queried variable, TreeCut yields controllable unanswerable instances (tunable by , , , and ) and their answerable counterparts, enabling precise analysis of when LLMs hallucinate. Empirical results show substantial zero-shot hallucination rates for GPT-4o and o3-mini, with deeper, more complex trees, composite item names, and mid-path cuts increasing vulnerability; few-shot prompts moderately mitigate hallucinations for some models. The work provides a public dataset and generation code to facilitate targeted investigations into LLM reasoning, unanswerability detection, and prompting strategies, with potential implications for safer deployment in mathematical reasoning tasks.

Abstract

Large language models (LLMs) now achieve near-human performance on standard math word problem benchmarks (e.g., GSM8K), yet their true reasoning ability remains disputed. A key concern is that models often produce confident, yet unfounded, answers to unanswerable problems. We introduce TreeCut, a synthetic dataset that systematically generates infinite unanswerable math word problems and their answerable counterparts, by representing each question as a tree and removing chosen necessary conditions. Experiments show TreeCut effectively induce hallucinations in large language models, including GPT-4o and o3-mini, with rates of 64% and 44% in their respective worst-case scenarios under zero-shot setting. Further analysis highlights that deeper or more complex trees, composite item names, and removing necessary condition near the middle of a path all increase the likelihood of hallucinations, underscoring the persistent challenges LLMs face in identifying unanswerable math problems. The dataset generation code and sample data are available at https://github.com/j-bagel/treecut-math.

Paper Structure

This paper contains 24 sections, 3 figures, 3 tables, 2 algorithms.

Figures (3)

  • Figure 1: The left and middle panels depict the tree structures corresponding to the answerable and unanswerable questions, respectively. In the right panel, the strike-through sentence represents the formula removed by the cut. The variable mappings to items are as follows: $x_1$ represents a burger, $x_2$ represents a scrambled egg, $x_3$ represents a BLT sandwich, and $x_4$ represents a pie.
  • Figure 2: Hallucination percentage under different configurations of unanswerable problems, plotted against varying ansDepth.
  • Figure 3: Hallucination percentage versus cutDepth. Left panel has $\texttt{ansDepth} = 7$. Right panel has $\texttt{ansDepth} = 8$.