Table of Contents
Fetching ...

rSIM: Incentivizing Reasoning Capabilities of LLMs via Reinforced Strategy Injection

Sijia Chen, Baochun Li, Di Niu

TL;DR

<3-5 sentence high-level summary> rSIM presents a reinforced strategy injection mechanism that decouples planning from reasoning to convert any LLM into an RLM by jointly training a small planner with the reasoner under a leader-follower MARL framework. The planner adaptively injects a predefined set of reasoning strategies into the chain-of-thought, yielding significant accuracy gains across diverse tasks while allowing the planner to be reused as a plug-in for other models. The approach enables continual improvement through task-agnostic planning and demonstrates cross-model generalization and sustained gains on coding tasks via a continually refined planner. Overall, rSIM offers a scalable, plug-in-friendly path to enhance reasoning in both small and large LLMs without extensive re-training of the base models.

Abstract

Large language models (LLMs) are post-trained through reinforcement learning (RL) to evolve into Reasoning Language Models (RLMs), where the hallmark of this advanced reasoning is ``aha'' moments when they start to perform strategies, such as self-reflection and deep thinking, within chain of thoughts (CoTs). Motivated by this, this paper proposes a novel reinforced strategy injection mechanism (rSIM), that enables any LLM to become an RLM by employing a small planner to guide the LLM's CoT through the adaptive injection of reasoning strategies. To achieve this, the planner (leader agent) is jointly trained with an LLM (follower agent) using multi-agent RL (MARL), based on a leader-follower framework and straightforward rule-based rewards. Experimental results show that rSIM enables Qwen2.5-0.5B to become an RLM and significantly outperform Qwen2.5-14B. Moreover, the planner is generalizable: it only needs to be trained once and can be applied as a plug-in to substantially improve the reasoning capabilities of existing LLMs. In addition, the planner supports continual learning across various tasks, allowing its planning abilities to gradually improve and generalize to a wider range of problems.

rSIM: Incentivizing Reasoning Capabilities of LLMs via Reinforced Strategy Injection

TL;DR

<3-5 sentence high-level summary> rSIM presents a reinforced strategy injection mechanism that decouples planning from reasoning to convert any LLM into an RLM by jointly training a small planner with the reasoner under a leader-follower MARL framework. The planner adaptively injects a predefined set of reasoning strategies into the chain-of-thought, yielding significant accuracy gains across diverse tasks while allowing the planner to be reused as a plug-in for other models. The approach enables continual improvement through task-agnostic planning and demonstrates cross-model generalization and sustained gains on coding tasks via a continually refined planner. Overall, rSIM offers a scalable, plug-in-friendly path to enhance reasoning in both small and large LLMs without extensive re-training of the base models.

Abstract

Large language models (LLMs) are post-trained through reinforcement learning (RL) to evolve into Reasoning Language Models (RLMs), where the hallmark of this advanced reasoning is ``aha'' moments when they start to perform strategies, such as self-reflection and deep thinking, within chain of thoughts (CoTs). Motivated by this, this paper proposes a novel reinforced strategy injection mechanism (rSIM), that enables any LLM to become an RLM by employing a small planner to guide the LLM's CoT through the adaptive injection of reasoning strategies. To achieve this, the planner (leader agent) is jointly trained with an LLM (follower agent) using multi-agent RL (MARL), based on a leader-follower framework and straightforward rule-based rewards. Experimental results show that rSIM enables Qwen2.5-0.5B to become an RLM and significantly outperform Qwen2.5-14B. Moreover, the planner is generalizable: it only needs to be trained once and can be applied as a plug-in to substantially improve the reasoning capabilities of existing LLMs. In addition, the planner supports continual learning across various tasks, allowing its planning abilities to gradually improve and generalize to a wider range of problems.

Paper Structure

This paper contains 23 sections, 2 equations, 6 figures, 5 tables, 2 algorithms.

Figures (6)

  • Figure 1: (a) Shows the absence of the "aha" moment in Qwen2.5-0.5B on the MATH dataset MATH-arxiv21. (b) Compares performance of LLMs before and after training with GRPO. By detecting key words, we count the average number of reasoning strategies, as presented in Figure \ref{['fig:mainstructure']}, that are used in answering each question.
  • Figure 2: Illustration of the cooperative pipeline between the rSIM planner (leader) and the LLM (reasoner/follower). The planner receives the question and the current reasoning steps, and selects one of nine strategies to inject into the reasoning process to guide the reasoner in generating the next step. This demo is based on a question from the MATH dataset.
  • Figure 3: Accuracy of Qwen2.5 models (0.5B, 1.5B, 7B) on the MATH dataset under various settings. "+GRPO" indicates training with GRPO GRPO-arxiv24. "+rSIM" denotes joint training with a Qwen2.5 planner, with the number indicating planner size. "+Planner" refers to using trained planners (0.5B, 7B) in plugin mode, derived from the MATH dataset (see Figure \ref{['fig:curves']}). Qwen2.5-14B is shown as a baseline with gray horizontal lines.
  • Figure 4: Illustration of accuracy for models of various sizes across different datasets, using the trained rSIM planner as a plugin. The terms with "+Planner" indicate that the base model collaborates with a trained planner during reasoning. The trained planners are derived from the MATH dataset, as shown in Figure \ref{['fig:curves']}.
  • Figure 5: Training and evaluation (Eval) curves of the rSIM on the MATH dataset, using either the Qwen2.5-0.5B or Qwen2.5-7B model as the planner, paired with the Qwen2.5-0.5B model as the reasoner.
  • ...and 1 more figures