Table of Contents
Fetching ...

Internalizing Meta-Experience into Memory for Guided Reinforcement Learning in Large Language Models

Shiting Huang, Zecheng Li, Yu Zeng, Qingnan Ren, Zhen Fang, Qisheng Su, Kou Shi, Lin Chen, Zehui Chen, Feng Zhao

TL;DR

This work tackles a meta-learning bottleneck in RLVR for large language models by introducing Meta-Experience Learning (MEL), which converts past reasoning failures into reusable, knowledge-level meta-experiences. Through contrastive analysis of correct versus incorrect trajectories, MEL locates bifurcation points, conducts deep diagnostics, and validates meta-experiences via replay before internalizing them into the model’s memory with a self-distillation objective. The resulting joint training effectively blends trajectory-level exploration with dense, knowledge-level guidance, producing improved Pass@1, Avg@8, and Pass@8 across multiple math benchmarks and model scales, with stronger gains as model size grows. MEL demonstrates the potential of internalizing cognitive mechanisms—error attribution and knowledge abstraction—into parametric memory to enhance intrinsic reasoning and generalize across learning paradigms.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for enhancing the reasoning capabilities of Large Language Models (LLMs). Despite its efficacy, RLVR faces a meta-learning bottleneck: it lacks mechanisms for error attribution and experience internalization intrinsic to the human learning cycle beyond practice and verification, thereby limiting fine-grained credit assignment and reusable knowledge formation. We term such reusable knowledge representations derived from past errors as meta-experience. Based on this insight, we propose Meta-Experience Learning (MEL), a novel framework that incorporates self-distilled meta-experience into the model's parametric memory. Building upon standard RLVR, we introduce an additional design that leverages the LLM's self-verification capability to conduct contrastive analysis on paired correct and incorrect trajectories, identify the precise bifurcation points where reasoning errors arise, and summarize them into generalizable meta-experience. The meta-experience is further internalized into the LLM's parametric memory by minimizing the negative log-likelihood, which induces a language-modeled reward signal that bridges correct and incorrect reasoning trajectories and facilitates effective knowledge reuse. Experimental results demonstrate that MEL achieves consistent improvements on benchmarks, yielding 3.92%--4.73% Pass@1 gains across varying model sizes.

Internalizing Meta-Experience into Memory for Guided Reinforcement Learning in Large Language Models

TL;DR

This work tackles a meta-learning bottleneck in RLVR for large language models by introducing Meta-Experience Learning (MEL), which converts past reasoning failures into reusable, knowledge-level meta-experiences. Through contrastive analysis of correct versus incorrect trajectories, MEL locates bifurcation points, conducts deep diagnostics, and validates meta-experiences via replay before internalizing them into the model’s memory with a self-distillation objective. The resulting joint training effectively blends trajectory-level exploration with dense, knowledge-level guidance, producing improved Pass@1, Avg@8, and Pass@8 across multiple math benchmarks and model scales, with stronger gains as model size grows. MEL demonstrates the potential of internalizing cognitive mechanisms—error attribution and knowledge abstraction—into parametric memory to enhance intrinsic reasoning and generalize across learning paradigms.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for enhancing the reasoning capabilities of Large Language Models (LLMs). Despite its efficacy, RLVR faces a meta-learning bottleneck: it lacks mechanisms for error attribution and experience internalization intrinsic to the human learning cycle beyond practice and verification, thereby limiting fine-grained credit assignment and reusable knowledge formation. We term such reusable knowledge representations derived from past errors as meta-experience. Based on this insight, we propose Meta-Experience Learning (MEL), a novel framework that incorporates self-distilled meta-experience into the model's parametric memory. Building upon standard RLVR, we introduce an additional design that leverages the LLM's self-verification capability to conduct contrastive analysis on paired correct and incorrect trajectories, identify the precise bifurcation points where reasoning errors arise, and summarize them into generalizable meta-experience. The meta-experience is further internalized into the LLM's parametric memory by minimizing the negative log-likelihood, which induces a language-modeled reward signal that bridges correct and incorrect reasoning trajectories and facilitates effective knowledge reuse. Experimental results demonstrate that MEL achieves consistent improvements on benchmarks, yielding 3.92%--4.73% Pass@1 gains across varying model sizes.
Paper Structure (25 sections, 10 equations, 9 figures, 1 table)

This paper contains 25 sections, 10 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Paradigm comparison between standard RLVR and MEL, where MEL extends RLVR with an explicit knowledge-level learning loop.
  • Figure 2: Overview of Meta-Experience Learning (MEL), which constructs meta-experiences from contrastive pairs via abstraction and validation, thereby introducing an explicit knowledge-level learning loop on top of standard RLVR.
  • Figure 3: Training curves comparing GRPO and MEL.
  • Figure 4: Case study comparing GRPO and MEL, with visualization of meta-experience in early stage.
  • Figure 5: Impact of meta-experience across different training methods, including Rejection Sampling Fine-Tuning (RFT) and REINFORCE++. ME denotes Meta-Experience.
  • ...and 4 more figures