Table of Contents
Fetching ...

Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning

Yoonjeon Kim, Doohyuk Jang, Eunho Yang

TL;DR

MASA introduces meta-awareness into reasoning models by training a meta-prediction pathway in parallel with solution reasoning, using self-alignment rewards to align meta-information with actual rollout statistics. The framework includes MASA-efficient variants with expert-trajectory SFT, predictive gating, and early cutoffs to boost training efficiency without external data. Empirically, MASA improves in-domain math benchmarks and enhances out-of-domain generalization, achieving up to 19.3% gains on AIME25 and over 1.28x training speedups. The work demonstrates that meta-prediction is a principled lever for boosting reasoning performance and generalization in large language models, with potential for broader meta-cognitive strategies.

Abstract

Recent studies on reasoning models explore the meta-awareness of language models, the ability to know how to think by itself. We argue that large reasoning models lack this meta-awareness property by proving severe misalignment between true rollouts and predicted meta information. We posit that aligning meta-prediction with true rollouts will lead to significant performance gains. To verify this hypothesis, we design a training pipeline that boosts Meta-Awareness via Self-Alignment (MASA), and prove that enhanced meta-awareness directly translates to improved accuracy. Unlike existing meta-cognitive reasoning models, our method does not require external training sources but leverages self-generated signals to train meta-awareness. Moreover, our method enables efficient training by i) filtering out zero-variance prompts that are either trivial or unsolvable and ii) cutting off lengthy rollouts when they are unlikely to lead to correct answers. The results are inspiring: our strategy yields significant improvements in both accuracy and training efficiency on in-domain tasks and shows strong generalization to out-of-domain benchmarks. More specifically, our method can speed up GRPO training by over 1.28x to reach the same performance, and achieve a 19.3% gain in accuracy on AIME25, and a 6.2 % average gain over six mathematics benchmarks. Training with meta-cognitive guidance enhances out-of-domain generalization, giving a 3.87 % boost on GPQA-Diamond and a 2.08 % overall accuracy gain across 13 benchmarks spanning logical, scientific, and coding domains.

Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning

TL;DR

MASA introduces meta-awareness into reasoning models by training a meta-prediction pathway in parallel with solution reasoning, using self-alignment rewards to align meta-information with actual rollout statistics. The framework includes MASA-efficient variants with expert-trajectory SFT, predictive gating, and early cutoffs to boost training efficiency without external data. Empirically, MASA improves in-domain math benchmarks and enhances out-of-domain generalization, achieving up to 19.3% gains on AIME25 and over 1.28x training speedups. The work demonstrates that meta-prediction is a principled lever for boosting reasoning performance and generalization in large language models, with potential for broader meta-cognitive strategies.

Abstract

Recent studies on reasoning models explore the meta-awareness of language models, the ability to know how to think by itself. We argue that large reasoning models lack this meta-awareness property by proving severe misalignment between true rollouts and predicted meta information. We posit that aligning meta-prediction with true rollouts will lead to significant performance gains. To verify this hypothesis, we design a training pipeline that boosts Meta-Awareness via Self-Alignment (MASA), and prove that enhanced meta-awareness directly translates to improved accuracy. Unlike existing meta-cognitive reasoning models, our method does not require external training sources but leverages self-generated signals to train meta-awareness. Moreover, our method enables efficient training by i) filtering out zero-variance prompts that are either trivial or unsolvable and ii) cutting off lengthy rollouts when they are unlikely to lead to correct answers. The results are inspiring: our strategy yields significant improvements in both accuracy and training efficiency on in-domain tasks and shows strong generalization to out-of-domain benchmarks. More specifically, our method can speed up GRPO training by over 1.28x to reach the same performance, and achieve a 19.3% gain in accuracy on AIME25, and a 6.2 % average gain over six mathematics benchmarks. Training with meta-cognitive guidance enhances out-of-domain generalization, giving a 3.87 % boost on GPQA-Diamond and a 2.08 % overall accuracy gain across 13 benchmarks spanning logical, scientific, and coding domains.

Paper Structure

This paper contains 25 sections, 6 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: (a) Existing large reasoning models lack meta-awareness. (b) MASA significantly improves meta-awareness, as shown by the alignment between meta-predictions and the actual rollout statistics (difficulty and length). (c) Training step has limited impact on accuracy. (d) Meta-awareness directly translates to increased accuracy.
  • Figure 2: Overall Framework of MASA (a) Parallel rollout of meta prediction path and solution path. Meta predictions are rewarded by self-alignment from statistics collected from solution roll-outs. (b) Meta-based predictive gating, early cutoff and notion hinting from meta-predictions.
  • Figure 3: (a) Notion score of positive / negative notions from earlier train step. (b) Precision Score of Predictive Gating on true zero variance prompts. (c) Precision Score of Early Cutoff on true incorrect roll-outs. Precisions are smoothed by a moving average over 5 steps.
  • Figure 4: Comparison of MASA-efficient and GRPO on same train budgets: number of seen train tasks, total generation tokens, and train time. Accuracy is calculated as the average of AIME'24, AIME'25, and AMC'23. All accuracy curves are smoothed with a 3-window moving average.
  • Figure 5: Analysis on Gating.
  • ...and 2 more figures