Table of Contents
Fetching ...

MDToC: Metacognitive Dynamic Tree of Concepts for Boosting Mathematical Problem-Solving of Large Language Models

Tung Duong Ta, Tim Oates, Thien Van Luong, Huan Vu, Tien Cuong Nguyen

TL;DR

MDToC introduces a metacognitive dynamic tree of concepts to boost mathematical problem-solving in LLMs by planning diverse concepts, monitoring calculations with evaluator/fixer loops, and reviewing results via majority voting. It constructs a depth-two concept tree, generates accuracy-verified calculations, and uses metacognitive prompts to mitigate intermediate errors. Across CHAMP, MATH, and Game-of-24 benchmarks, MDToC consistently outperforms Tree-of-Thought and Graph-of-Thought prompting across multiple backbone models, including GPT-4 variants, albeit with higher compute cost. The work highlights the promise of metacognitive calculation verification for more reliable, scalable math reasoning in LLMs and discusses practical considerations like cost and hyperparameter sensitivity.

Abstract

Despite advances in mathematical reasoning capabilities, Large Language Models (LLMs) still struggle with calculation verification when using established prompting techniques. We present MDToC (Metacognitive Dynamic Tree of Concepts), a three-phase approach that constructs a concept tree, develops accuracy-verified calculations for each concept, and employs majority voting to evaluate competing solutions. Evaluations across CHAMP, MATH, and Game-of-24 benchmarks demonstrate our MDToC's effectiveness, with GPT-4-Turbo achieving 58.1\% on CHAMP, 86.6\% on MATH, and 85\% on Game-of-24 - outperforming GoT by 5\%, 5.4\%, and 4\% on all these tasks, respectively, without hand-engineered hints. MDToC consistently surpasses existing prompting methods across all backbone models, yielding improvements of up to 7.6\% over ToT and 6.2\% over GoT, establishing metacognitive calculation verification as a promising direction for enhanced mathematical reasoning.

MDToC: Metacognitive Dynamic Tree of Concepts for Boosting Mathematical Problem-Solving of Large Language Models

TL;DR

MDToC introduces a metacognitive dynamic tree of concepts to boost mathematical problem-solving in LLMs by planning diverse concepts, monitoring calculations with evaluator/fixer loops, and reviewing results via majority voting. It constructs a depth-two concept tree, generates accuracy-verified calculations, and uses metacognitive prompts to mitigate intermediate errors. Across CHAMP, MATH, and Game-of-24 benchmarks, MDToC consistently outperforms Tree-of-Thought and Graph-of-Thought prompting across multiple backbone models, including GPT-4 variants, albeit with higher compute cost. The work highlights the promise of metacognitive calculation verification for more reliable, scalable math reasoning in LLMs and discusses practical considerations like cost and hyperparameter sensitivity.

Abstract

Despite advances in mathematical reasoning capabilities, Large Language Models (LLMs) still struggle with calculation verification when using established prompting techniques. We present MDToC (Metacognitive Dynamic Tree of Concepts), a three-phase approach that constructs a concept tree, develops accuracy-verified calculations for each concept, and employs majority voting to evaluate competing solutions. Evaluations across CHAMP, MATH, and Game-of-24 benchmarks demonstrate our MDToC's effectiveness, with GPT-4-Turbo achieving 58.1\% on CHAMP, 86.6\% on MATH, and 85\% on Game-of-24 - outperforming GoT by 5\%, 5.4\%, and 4\% on all these tasks, respectively, without hand-engineered hints. MDToC consistently surpasses existing prompting methods across all backbone models, yielding improvements of up to 7.6\% over ToT and 6.2\% over GoT, establishing metacognitive calculation verification as a promising direction for enhanced mathematical reasoning.

Paper Structure

This paper contains 26 sections, 1 equation, 8 figures, 8 tables.

Figures (8)

  • Figure 1: ToT prompting yields initial abstract thoughts (e.g., analyses, concepts, calculations; in red), which are challenging to evaluate due to the intangible nature of conceptual reasoning and the lack of specific criteria to measure their correctness or completeness. Our MDToC addresses these abstract thoughts by first generating concrete concepts and then producing relevant calculations for those concepts. We only evaluate the preciseness of the calculations through mathematical accuracy checks, enabling precise evaluation and thus improving problem-solving reliability.
  • Figure 2: Proposed MDToC prompting structure. $C$ represents the first-depth concept, while $SC$ represents the second-depth sub-concept. $P_0$ and $P_1$ are prompts used in the planning phase shown in Figure \ref{['p_plan']}, while $P_2$, $P_{3}$, $P_4$, and $P_5$ are prompts used in the monitoring phase given in Figure \ref{['p_mnt']}. Prompts $P_6$ is the prompt in the review phase, as shown in Fig. \ref{['p_review']}.
  • Figure 3: Prompts for the planning phase.
  • Figure 4: Prompts for the monitoring phase.
  • Figure 5: Comparative analysis of reasoning steps: GPT-3.5-Turbo with CoT and CH versus GPT-3.5-Turbo with our MDToC approach. Subfigure (a) displays GPT‑3.5-Turbo’s reasoning with CoT, supplemented by annotated concepts and hints (CH) intended to guide the model’s step-by-step reasoning; IT denotes intermediate thoughts, and FA indicates the final answer. Although these conceptual hints attempt to structure the problem-solving process, GPT‑3.5-Turbo still yields an incorrect count of 34. Because there is no automatic mechanism to spot and correct mistakes in intermediate steps, the model’s calculation errors persist through to the final answer. Subfigure (b) shows our proposed concept tree (CT) approach under a multi-attempt evaluator–fixer framework, referred to here as MDToC. IC stands for intermediate calculations, and each is evaluated by an evaluator component. In this example, IC3 and IC4 are identified as incorrect, triggering the fixer to regenerate corrected values in IC5. This iterative refine-and-fix process avoids propagating calculation errors, ultimately yielding the correct final answer FA of 41. Notably, this process requires no extra annotated hints — only the concept tree plus repeated evaluation up to $2$ attempts, a threshold chosen to reduce the risk of model “hallucinations” (erroneous or fabricated steps).
  • ...and 3 more figures