Table of Contents
Fetching ...

Metacognitive Prompting Improves Understanding in Large Language Models

Yuqing Wang, Yun Zhao

TL;DR

This work introduces Metacognitive Prompting (MP), a five-stage introspective prompting framework designed to deepen LLM understanding rather than merely improve chain-of-thought reasoning. By guiding models through interpretation, initial judgments, critical reevaluation, final explanations, and confidence assessment, MP yields consistent improvements over CoT variants across ten NLU datasets spanning general and domain-specific tasks, with GPT-4 achieving the strongest performance. The study provides detailed error and confidence analyses, revealing distinct overthinking and overcorrection patterns and highlighting domain-specific terminological and interpretive challenges. While promising, MP relies on manual prompt design and exhibits calibration gaps, suggesting future work on adaptive prompting and robust confidence estimation to enhance reliability and applicability. Overall, MP advances the goal of more reliable, human-like understanding in large language models and offers a practical path toward more explainable NLU systems.

Abstract

In Large Language Models (LLMs), there have been consistent advancements in task-specific performance, largely influenced by effective prompt design. Recent advancements in prompting have enhanced reasoning in logic-intensive tasks for LLMs, yet the nuanced understanding abilities of these models, crucial for processing and interpreting complex information, remain underexplored. In this study, we introduce Metacognitive Prompting (MP), a strategy inspired by human introspective reasoning processes. Using MP, LLMs undergo a systematic series of structured, self-aware evaluations, drawing on both their vast inherent knowledge and new insights. We conduct extensive experiments on four prevalent LLMs: Llama2, PaLM2, GPT-3.5, and GPT-4, across ten natural language understanding (NLU) datasets from GLUE, SuperGLUE, BLUE, and LexGLUE benchmarks. Additionally, we compare our method with chain-of-thought prompting and its advanced versions. The results show that GPT-4 consistently excels across all tasks, while other models have shown significant progress in some tasks when used in conjunction with MP. Furthermore, MP consistently outperforms existing prompting methods in both general and domain-specific NLU tasks. This study underscores the potential to amplify the understanding abilities of LLMs and highlights the benefits of mirroring human introspective reasoning in NLU tasks.

Metacognitive Prompting Improves Understanding in Large Language Models

TL;DR

This work introduces Metacognitive Prompting (MP), a five-stage introspective prompting framework designed to deepen LLM understanding rather than merely improve chain-of-thought reasoning. By guiding models through interpretation, initial judgments, critical reevaluation, final explanations, and confidence assessment, MP yields consistent improvements over CoT variants across ten NLU datasets spanning general and domain-specific tasks, with GPT-4 achieving the strongest performance. The study provides detailed error and confidence analyses, revealing distinct overthinking and overcorrection patterns and highlighting domain-specific terminological and interpretive challenges. While promising, MP relies on manual prompt design and exhibits calibration gaps, suggesting future work on adaptive prompting and robust confidence estimation to enhance reliability and applicability. Overall, MP advances the goal of more reliable, human-like understanding in large language models and offers a practical path toward more explainable NLU systems.

Abstract

In Large Language Models (LLMs), there have been consistent advancements in task-specific performance, largely influenced by effective prompt design. Recent advancements in prompting have enhanced reasoning in logic-intensive tasks for LLMs, yet the nuanced understanding abilities of these models, crucial for processing and interpreting complex information, remain underexplored. In this study, we introduce Metacognitive Prompting (MP), a strategy inspired by human introspective reasoning processes. Using MP, LLMs undergo a systematic series of structured, self-aware evaluations, drawing on both their vast inherent knowledge and new insights. We conduct extensive experiments on four prevalent LLMs: Llama2, PaLM2, GPT-3.5, and GPT-4, across ten natural language understanding (NLU) datasets from GLUE, SuperGLUE, BLUE, and LexGLUE benchmarks. Additionally, we compare our method with chain-of-thought prompting and its advanced versions. The results show that GPT-4 consistently excels across all tasks, while other models have shown significant progress in some tasks when used in conjunction with MP. Furthermore, MP consistently outperforms existing prompting methods in both general and domain-specific NLU tasks. This study underscores the potential to amplify the understanding abilities of LLMs and highlights the benefits of mirroring human introspective reasoning in NLU tasks.
Paper Structure (20 sections, 5 figures, 3 tables)

This paper contains 20 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Alignment between human metacognitive processes and the stages of MP in LLMs.
  • Figure 2: Our proposed method, metacognitive prompting, emulates critical steps of human metacognition, consisting of five stages: 1) understanding the input text, 2) making a preliminary judgment, 3) critically evaluating this preliminary analysis, 4) reaching a final decision accompanied by an explanation of the reasoning, and 5) evaluating the confidence level in the entire process. By reflecting on human self-assessment, these stages guide the LLM, aiding in more accurate text interpretation and facilitating better judgment formation. The diagram features three columns, from left to right, representing the high-level metacognitive stages, specific metacognitive prompts fed into the LLM, and the LLM's corresponding outputs. Prompts in the middle column are collectively fed into the LLM as a single input during the experiments. The figure illustrates a sample question chosen from the Quora Question Pair (QQP) dataset in the GLUE benchmark.
  • Figure 3: Comparison of average performance for all prompting methods in both zero-shot and 5-shot learning scenarios across four LLMs. Performance metrics are averaged over all datasets, treating each dataset and metric with equal significance and assuming direct comparability. MP consistently surpasses other methods.
  • Figure 4: Two major error types with MP: overthinking (excessive analysis) and overcorrection (excessive adjustment). Example questions are from the WiC dataset.
  • Figure 5: The relationship between correctness and confidence levels under MP, averaged over all datasets and models.