Table of Contents
Fetching ...

Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models

Abraham Paul Elenjical, Vivek Hruday Kavuri, Vasudeva Varma

TL;DR

This work presents a cognitively principled framework for metacognition in large language models by mapping Brown's Planning–Monitoring–Evaluation cycle into a prompting architecture and pairing it with a lightweight dual-process MetaController for adaptive effort allocation. Across open-weight models Llama-3-8B and Qwen-3-8B on diverse benchmarks targeting higher-order cognition, explicitly structured regulation improves error diagnosis and self-correction, with human evaluators showing a strong preference for the resulting reasoning traces in trustworthiness and self-awareness. The results reveal model-dependent effects, where native reasoning models benefit most from the framework, while uniformly enforcing regulation can hinder weaker, instruction-tuned models, motivating adaptive routing that balances cost and benefit. The findings advocate for psychologically grounded evaluation protocols and datasets that stress diagnostic and regulatory capabilities, offering a principled path toward more transparent and robust AI systems.

Abstract

Large Language Models (LLMs) demonstrate strong reasoning performance, yet their ability to reliably monitor, diagnose, and correct their own errors remains limited. We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight dual-process MetaController for adaptive effort allocation. Across diverse reasoning and diagnostic benchmarks (GSM8K, CRUXEval, MBPP, AIME, CorrectBench, and TruthfulQA) using Llama-3 and Qwen-3 (8B), explicit regulatory structuring substantially improves error diagnosis and yields a threefold increase in successful self-correction. Blinded human evaluations over 580 query pairs show an 84% aggregate preference for trustworthiness and metacognitive self-awareness over standard and Chain-of-Thought baselines. Grounding LLM reasoning in established cognitive theory offers a principled path toward more transparent and diagnostically robust AI systems.

Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models

TL;DR

This work presents a cognitively principled framework for metacognition in large language models by mapping Brown's Planning–Monitoring–Evaluation cycle into a prompting architecture and pairing it with a lightweight dual-process MetaController for adaptive effort allocation. Across open-weight models Llama-3-8B and Qwen-3-8B on diverse benchmarks targeting higher-order cognition, explicitly structured regulation improves error diagnosis and self-correction, with human evaluators showing a strong preference for the resulting reasoning traces in trustworthiness and self-awareness. The results reveal model-dependent effects, where native reasoning models benefit most from the framework, while uniformly enforcing regulation can hinder weaker, instruction-tuned models, motivating adaptive routing that balances cost and benefit. The findings advocate for psychologically grounded evaluation protocols and datasets that stress diagnostic and regulatory capabilities, offering a principled path toward more transparent and robust AI systems.

Abstract

Large Language Models (LLMs) demonstrate strong reasoning performance, yet their ability to reliably monitor, diagnose, and correct their own errors remains limited. We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight dual-process MetaController for adaptive effort allocation. Across diverse reasoning and diagnostic benchmarks (GSM8K, CRUXEval, MBPP, AIME, CorrectBench, and TruthfulQA) using Llama-3 and Qwen-3 (8B), explicit regulatory structuring substantially improves error diagnosis and yields a threefold increase in successful self-correction. Blinded human evaluations over 580 query pairs show an 84% aggregate preference for trustworthiness and metacognitive self-awareness over standard and Chain-of-Thought baselines. Grounding LLM reasoning in established cognitive theory offers a principled path toward more transparent and diagnostically robust AI systems.
Paper Structure (72 sections, 4 figures, 4 tables)

This paper contains 72 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of the proposed meta-controller architecture that routes user queries between direct generation (System 1) and deliberate reasoning (System 2) to produce the final answer.
  • Figure 2: The blinded interface used for human evaluation. Annotators were presented with the original prompt and two anonymized reasoning traces, evaluating them first individually for error diagnosis, and then comparatively for trustworthiness.
  • Figure 3: Human evaluation win rates (excluding ties) showing Ann Brown prompting significantly outperforming all baselines across trustworthiness, self-awareness, and real-world preference.
  • Figure 4: Comparison of strict error rates indicating lower overall failure rates for Ann Brown prompting relative to baseline strategies (lower is better).