Table of Contents
Fetching ...

CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning

Siyuan Ma, Bo Gao, Zikai Xiao, Hailong Wang, Xinlei Yu, Rui Qian, Jiayu Qian, Luqi Gong, Yang Liu

Abstract

Recent test-time reasoning methods improve performance by generating more candidate chains or searching over larger reasoning trees, but they typically lack explicit control over when to expand, what to prune, how to repair, and when to abstain. We introduce CoT2-Meta, a training-free metacognitive reasoning framework that combines object-level chain-of-thought generation with meta-level control over partial reasoning trajectories. The framework integrates four components: strategy-conditioned thought generation, tree-structured search, an online process oracle for step-level reasoning evaluation, and a meta-controller that allocates computation through expansion, pruning, repair, stopping, and fallback decisions. Under matched inference budgets, CoT2-Meta consistently outperforms strong single-path, sampling-based, and search-based baselines, including ReST-MCTS. On the default backbone, it achieves 92.8 EM on MATH, 90.4 accuracy on GPQA, 98.65 EM on GSM8K, 75.8 accuracy on BBEH, 85.6 accuracy on MMMU-Pro, and 48.8 accuracy on HLE, with gains over the strongest non-CoT2-Meta baseline of +3.6, +5.2, +1.15, +2.0, +4.3, and +4.3 points, respectively. Beyond these core results, the framework remains effective across a broader 15-benchmark suite spanning knowledge and QA, multi-hop reasoning, coding, and out-of-distribution evaluation. Additional analyses show better compute scaling, improved calibration, stronger selective prediction, targeted repair behavior, and consistent gains across backbone families. These results suggest that explicit metacognitive control is a practical design principle for reliable and compute-efficient test-time reasoning systems.

CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning

Abstract

Recent test-time reasoning methods improve performance by generating more candidate chains or searching over larger reasoning trees, but they typically lack explicit control over when to expand, what to prune, how to repair, and when to abstain. We introduce CoT2-Meta, a training-free metacognitive reasoning framework that combines object-level chain-of-thought generation with meta-level control over partial reasoning trajectories. The framework integrates four components: strategy-conditioned thought generation, tree-structured search, an online process oracle for step-level reasoning evaluation, and a meta-controller that allocates computation through expansion, pruning, repair, stopping, and fallback decisions. Under matched inference budgets, CoT2-Meta consistently outperforms strong single-path, sampling-based, and search-based baselines, including ReST-MCTS. On the default backbone, it achieves 92.8 EM on MATH, 90.4 accuracy on GPQA, 98.65 EM on GSM8K, 75.8 accuracy on BBEH, 85.6 accuracy on MMMU-Pro, and 48.8 accuracy on HLE, with gains over the strongest non-CoT2-Meta baseline of +3.6, +5.2, +1.15, +2.0, +4.3, and +4.3 points, respectively. Beyond these core results, the framework remains effective across a broader 15-benchmark suite spanning knowledge and QA, multi-hop reasoning, coding, and out-of-distribution evaluation. Additional analyses show better compute scaling, improved calibration, stronger selective prediction, targeted repair behavior, and consistent gains across backbone families. These results suggest that explicit metacognitive control is a practical design principle for reliable and compute-efficient test-time reasoning systems.

Paper Structure

This paper contains 76 sections, 21 equations, 11 figures, 25 tables.

Figures (11)

  • Figure 1: CoT$^2$-Meta as reasoning over reasoning. The inner layer performs object-level reasoning, while the outer layer performs metacognitive control over partial trajectories. The meta-controller decides when to expand, prune, repair, stop, or abstain under a bounded budget.
  • Figure 2: CoT$^2$-Meta pipeline. Strategy-conditioned generation forms a tree of partial trajectories, which are evaluated by an online process oracle using only the input and current reasoning path. The meta-controller converts oracle signals into meta-states and allocates budget through expand, prune, repair, stop, and abstain actions. The output is either the best validated trajectory or an abstention/fallback decision, together with structured decision traces.
  • Figure 3: Compute--accuracy scaling on MATH. Accuracy versus total inference budget. CoT$^2$-Meta is strongest in the low-budget regime and reaches the highest ceiling at larger budgets.
  • Figure 4: Calibration and selective prediction on MATH ($C=16$).(a) Selective accuracy versus coverage. (b) Reliability diagram. CoT$^2$-Meta shows the best calibration and selective accuracy among the compared methods.
  • Figure 5: Decision trace interpretability of CoT$^2$-Meta on the curated hard subset of MATH. (a) Action distribution over search depth, showing early expansion, mid-stage pruning, and late repair/stop behavior. (b) Pruning audit across datasets, comparing prune precision, false-prune rate, and oracle gap. (c) Root-to-outcome flow on the MATH hard subset.
  • ...and 6 more figures