Table of Contents
Fetching ...

Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving

Aniket Didolkar, Anirudh Goyal, Nan Rosemary Ke, Siyuan Guo, Michal Valko, Timothy Lillicrap, Danilo Rezende, Yoshua Bengio, Michael Mozer, Sanjeev Arora

TL;DR

The paper investigates whether large language models possess metacognitive knowledge about their own reasoning by inducing a catalog of math-solving skills. It constructs a Skill Exemplar Repository through a two-stage process: fine-grained skill labeling by a strong LLM and semantic clustering into coarse, human-interpretable skills, then uses these exemplars to guide in-context solving. Empirical results on GSM8K and MATH show substantial gains over Chain-of-Thought and other baselines, including 11.6% improvement on MATH and 7.52% when integrating with program-aided prompting, with demonstrated transfer to weaker LLMs and cross-dataset applicability. The findings suggest that metacognitive skill knowledge can bootstrap improved reasoning and generalize across models and domains, offering a pathway toward broader bootstrapping of LLM capabilities beyond math.

Abstract

Metacognitive knowledge refers to humans' intuitive knowledge of their own thinking and reasoning processes. Today's best LLMs clearly possess some reasoning processes. The paper gives evidence that they also have metacognitive knowledge, including ability to name skills and procedures to apply given a task. We explore this primarily in context of math reasoning, developing a prompt-guided interaction procedure to get a powerful LLM to assign sensible skill labels to math questions, followed by having it perform semantic clustering to obtain coarser families of skill labels. These coarse skill labels look interpretable to humans. To validate that these skill labels are meaningful and relevant to the LLM's reasoning processes we perform the following experiments. (a) We ask GPT-4 to assign skill labels to training questions in math datasets GSM8K and MATH. (b) When using an LLM to solve the test questions, we present it with the full list of skill labels and ask it to identify the skill needed. Then it is presented with randomly selected exemplar solved questions associated with that skill label. This improves accuracy on GSM8k and MATH for several strong LLMs, including code-assisted models. The methodology presented is domain-agnostic, even though this article applies it to math problems.

Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving

TL;DR

The paper investigates whether large language models possess metacognitive knowledge about their own reasoning by inducing a catalog of math-solving skills. It constructs a Skill Exemplar Repository through a two-stage process: fine-grained skill labeling by a strong LLM and semantic clustering into coarse, human-interpretable skills, then uses these exemplars to guide in-context solving. Empirical results on GSM8K and MATH show substantial gains over Chain-of-Thought and other baselines, including 11.6% improvement on MATH and 7.52% when integrating with program-aided prompting, with demonstrated transfer to weaker LLMs and cross-dataset applicability. The findings suggest that metacognitive skill knowledge can bootstrap improved reasoning and generalize across models and domains, offering a pathway toward broader bootstrapping of LLM capabilities beyond math.

Abstract

Metacognitive knowledge refers to humans' intuitive knowledge of their own thinking and reasoning processes. Today's best LLMs clearly possess some reasoning processes. The paper gives evidence that they also have metacognitive knowledge, including ability to name skills and procedures to apply given a task. We explore this primarily in context of math reasoning, developing a prompt-guided interaction procedure to get a powerful LLM to assign sensible skill labels to math questions, followed by having it perform semantic clustering to obtain coarser families of skill labels. These coarse skill labels look interpretable to humans. To validate that these skill labels are meaningful and relevant to the LLM's reasoning processes we perform the following experiments. (a) We ask GPT-4 to assign skill labels to training questions in math datasets GSM8K and MATH. (b) When using an LLM to solve the test questions, we present it with the full list of skill labels and ask it to identify the skill needed. Then it is presented with randomly selected exemplar solved questions associated with that skill label. This improves accuracy on GSM8k and MATH for several strong LLMs, including code-assisted models. The methodology presented is domain-agnostic, even though this article applies it to math problems.
Paper Structure (37 sections, 5 figures, 15 tables)

This paper contains 37 sections, 5 figures, 15 tables.

Figures (5)

  • Figure 1: Creating Skill Exemplar Repository: First, a powerful LLM A labels each question with a corresponding skill, as detailed in the prompt provided in Appendix Figure \ref{['fig:skill_prompts']} (left). Next, LLM A is asked to combine similar fine-grained skills into broader skill clusters, which represent complex skills. This greatly reduces the number of unique skills from the first stage. The prompt for this is depicted in Appendix Figure \ref{['fig:skill_prompts']} (middle). Then the LLM is asked to reclassify all examples from the training set into one of the post-clustering skills. Using these we create a 'Skill Exemplar Repository' which consists of skill exemplars consisting of skill names and their corresponding question/answer examples. Inference: During inference on a test question using LLM B (where B may be different from A or not) we ask LLM B to label the test question with one of the skills from the skill exemplar repository. Next, we fetch exemplars from the repository with the same skill label and provide them as in-context examples to LLM B to help it solve the test question.
  • Figure 2: Prompts for Skill Labelling and Clustering (left) The prompt which is used for labelling all examples in the training set $\mathcal{T}$ with skills. (middle) The prompt used for clustering the skills obtained after skill labelling. (right) The prompt used to label each test set example with skills.
  • Figure 3: Skill Wise Plot In this Figure we compare the the per-skill accuracies for the Skill-Based approach and the Random approach on the GSM8K dataset. We can see that proposed Skill-Based approach results in better accuracies for 11/18 skills.
  • Figure 4: Ablation Metrics This Figure compares the skill success rate, Secondary Skill Success Rate, and Calculation Success Rate of the Topic-Based and Skill-Based approaches. We expect the proposed skill-based approach to be mainly useful in picking the correct skills. We find that this is indeed the case as it achieves a higher skill success rate than the Topic-Based approach. Moreover, we find that proposed approach also results in lower calculation and secondary skill errors.
  • Figure 5: Ablation Prompt This figure shows the prompt which is given to GPT 4 to categorize each example from the MATH dataset into Skill Error, Secondary Skill Error, or Calculation Error