Table of Contents
Fetching ...

ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning

Ziyu Wan, Yunxiang Li, Xiaoyu Wen, Yan Song, Hanjing Wang, Linyi Yang, Mark Schmidt, Jun Wang, Weinan Zhang, Shuyue Hu, Ying Wen

TL;DR

The paper addresses the challenge of enabling robust meta-thinking in LLMs by introducing ReMA, a two-agent MARL framework that decouples meta-thinking (high-level planning) from detailed reasoning (low-level execution). It formalizes both single-turn and multi-turn meta-thinking reasoning (MRP and MAMRP), with parameter-sharing strategies and turn-level ratio mechanisms to stabilize training. Empirical results on mathematical reasoning and LLM-as-a-Judge benchmarks show that ReMA consistently surpasses single-agent baselines, with strong out-of-distribution generalization, and that multi-turn extensions yield additional gains under careful hyperparameter control. Overall, the work demonstrates that structured, interactive agents guided by reinforcement learning can significantly enhance reasoning capability and generalization in LLMs, offering a scalable path for complex, long-horizon problems.

Abstract

Recent research on Reasoning of Large Language Models (LLMs) has sought to further enhance their performance by integrating meta-thinking -- enabling models to monitor, evaluate, and control their reasoning processes for more adaptive and effective problem-solving. However, current single-agent work lacks a specialized design for acquiring meta-thinking, resulting in low efficacy. To address this challenge, we introduce Reinforced Meta-thinking Agents (ReMA), a novel framework that leverages Multi-Agent Reinforcement Learning (MARL) to elicit meta-thinking behaviors, encouraging LLMs to think about thinking. ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions. Through iterative reinforcement learning with aligned objectives, these agents explore and learn collaboration, leading to improved generalization and robustness. Empirical results from single-turn experiments demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks, including competitive-level mathematical benchmarks and LLM-as-a-Judge benchmarks. Additionally, we further extend ReMA to multi-turn interaction settings, leveraging turn-level ratio and parameter sharing to improve efficiency. Comprehensive ablation studies further illustrate the evolving dynamics of each distinct agent, providing valuable insights into how the meta-thinking reasoning process enhances the reasoning capabilities of LLMs. Our code can be found in https://github.com/ziyuwan/ReMA-public

ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning

TL;DR

The paper addresses the challenge of enabling robust meta-thinking in LLMs by introducing ReMA, a two-agent MARL framework that decouples meta-thinking (high-level planning) from detailed reasoning (low-level execution). It formalizes both single-turn and multi-turn meta-thinking reasoning (MRP and MAMRP), with parameter-sharing strategies and turn-level ratio mechanisms to stabilize training. Empirical results on mathematical reasoning and LLM-as-a-Judge benchmarks show that ReMA consistently surpasses single-agent baselines, with strong out-of-distribution generalization, and that multi-turn extensions yield additional gains under careful hyperparameter control. Overall, the work demonstrates that structured, interactive agents guided by reinforcement learning can significantly enhance reasoning capability and generalization in LLMs, offering a scalable path for complex, long-horizon problems.

Abstract

Recent research on Reasoning of Large Language Models (LLMs) has sought to further enhance their performance by integrating meta-thinking -- enabling models to monitor, evaluate, and control their reasoning processes for more adaptive and effective problem-solving. However, current single-agent work lacks a specialized design for acquiring meta-thinking, resulting in low efficacy. To address this challenge, we introduce Reinforced Meta-thinking Agents (ReMA), a novel framework that leverages Multi-Agent Reinforcement Learning (MARL) to elicit meta-thinking behaviors, encouraging LLMs to think about thinking. ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions. Through iterative reinforcement learning with aligned objectives, these agents explore and learn collaboration, leading to improved generalization and robustness. Empirical results from single-turn experiments demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks, including competitive-level mathematical benchmarks and LLM-as-a-Judge benchmarks. Additionally, we further extend ReMA to multi-turn interaction settings, leveraging turn-level ratio and parameter sharing to improve efficiency. Comprehensive ablation studies further illustrate the evolving dynamics of each distinct agent, providing valuable insights into how the meta-thinking reasoning process enhances the reasoning capabilities of LLMs. Our code can be found in https://github.com/ziyuwan/ReMA-public

Paper Structure

This paper contains 66 sections, 29 equations, 12 figures, 2 tables, 1 algorithm.

Figures (12)

  • Figure 1: Left: A construction-based method that fine-tunes LLMs using rejection sampling, searching among combinations of pre-defined templates. Middle: R1-like method learns to mix meta-thinking and detailed reasoning steps during training. Right: Our method ReMA separates the meta-thinking and reasoning steps in a multi-agent system and updated by reinforcement learning.
  • Figure 2: Comparison of training pipelines. Left: RL training of VRP and MRP, where a single LM agent is updated either with mixed (VRP) or explicit (MRP) meta-thinking. Middle: ReMA with separate parameters for the high-level (meta-thinking) and low-level (reasoning) agents; training alternates between freezing one agent and updating the other. Right: ReMA with shared parameters and multi-turn interactions: both agents share the same parameters and are distinguished by their system prompts. Training employs a turn-level ratio for stable multi-turn reinforcement learning and efficient updates, ensuring each turn’s contribution is controlled to prevent instability.
  • Figure 3: An RL experiment with 3 training schemes. While RL from SFT excels on easier problems, RL under Meta-thinking shows superior generalization to harder problems like AIME24.
  • Figure 4: Average problem difficulty by action type during training. Left: 1B LM collapses to the EMPTY action. Right: 8B LM adapts to a more complex meta-thinking strategy for harder problems.
  • Figure 5: Training results of multi-turn ReMA on MATH-Level3-5-8K under different rollout configurations.
  • ...and 7 more figures