Table of Contents
Fetching ...

A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models

Junjie Zhang, Guozheng Ma, Shunyu Liu, Haoyu Wang, Jiaxing Huang, Ting-En Lin, Fei Huang, Yongbin Li, Dacheng Tao

TL;DR

This paper introduces MeRF, an intuitive yet effective method enhancing reinforcement finetuning of LLMs by involving telling LLMs rules of the game, which directly injects the reward specification into the prompt, which serves as an in-context motivation for the model to be aware of the optimization objective.

Abstract

Reinforcement Learning with Verifiable Rewards~(RLVR) has emerged as a powerful learn-to-reason paradigm for large reasoning models to tackle complex tasks. However, the current RLVR paradigm is still not efficient enough, as it works in a trial-and-error manner. To perform better, the model needs to explore the reward space by numerously generating responses and learn from fragmented reward signals, blind to the overall reward patterns. Fortunately, verifiable rewards make the natural language description of the reward function possible, and meanwhile, LLMs have demonstrated strong in-context learning ability. This motivates us to explore if large reasoning models can benefit from a \textbf{motivation} of the task, \textit{i.e.}, awareness of the reward function, during the reinforcement finetuning process, as we humans sometimes do when learning. In this paper, we introduce \textit{\textbf{M}otivation-\textbf{e}nhanced \textbf{R}einforcement \textbf{F}inetuning}~(\textbf{MeRF}), an intuitive yet effective method enhancing reinforcement finetuning of LLMs by involving \emph{``telling LLMs rules of the game''}. Specifically, \textbf{MeRF} directly injects the reward specification into the prompt, which serves as an in-context motivation for the model to be aware of the optimization objective. This simple modification leverages the in-context learning ability of LLMs, aligning generation with optimization, thereby incentivizing the model to generate desired outputs from both inner motivation and external reward. Empirical evaluations demonstrate that \textbf{MeRF} achieves substantial performance gains over the RLVR baseline. Moreover, ablation studies show that MeRF performs better with greater consistency between the in-context motivation and the external reward function, while the model also demonstrates an ability to adapt to misleading motivations through reinforcement finetuning.

A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models

TL;DR

This paper introduces MeRF, an intuitive yet effective method enhancing reinforcement finetuning of LLMs by involving telling LLMs rules of the game, which directly injects the reward specification into the prompt, which serves as an in-context motivation for the model to be aware of the optimization objective.

Abstract

Reinforcement Learning with Verifiable Rewards~(RLVR) has emerged as a powerful learn-to-reason paradigm for large reasoning models to tackle complex tasks. However, the current RLVR paradigm is still not efficient enough, as it works in a trial-and-error manner. To perform better, the model needs to explore the reward space by numerously generating responses and learn from fragmented reward signals, blind to the overall reward patterns. Fortunately, verifiable rewards make the natural language description of the reward function possible, and meanwhile, LLMs have demonstrated strong in-context learning ability. This motivates us to explore if large reasoning models can benefit from a \textbf{motivation} of the task, \textit{i.e.}, awareness of the reward function, during the reinforcement finetuning process, as we humans sometimes do when learning. In this paper, we introduce \textit{\textbf{M}otivation-\textbf{e}nhanced \textbf{R}einforcement \textbf{F}inetuning}~(\textbf{MeRF}), an intuitive yet effective method enhancing reinforcement finetuning of LLMs by involving \emph{``telling LLMs rules of the game''}. Specifically, \textbf{MeRF} directly injects the reward specification into the prompt, which serves as an in-context motivation for the model to be aware of the optimization objective. This simple modification leverages the in-context learning ability of LLMs, aligning generation with optimization, thereby incentivizing the model to generate desired outputs from both inner motivation and external reward. Empirical evaluations demonstrate that \textbf{MeRF} achieves substantial performance gains over the RLVR baseline. Moreover, ablation studies show that MeRF performs better with greater consistency between the in-context motivation and the external reward function, while the model also demonstrates an ability to adapt to misleading motivations through reinforcement finetuning.

Paper Structure

This paper contains 19 sections, 3 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Validation Accuracy of MeRF and RLVR baseline on K&K Logic Puzzles in the training process. By simply "telling LLMs rules of the game" with in-context motivation during RL training, MeRF significantly outperforms the RLVR baseline with faster improvements, demonstrating the effectiveness of leveraging in-context motivation for more efficient RL training of LLMs.
  • Figure 2: (Left) Illustration of the RLVR pipeline and the in-context motivation introduced by MeRF. Compared to the indirect way (reward samples generated and through parameter updates) to learn the reward patterns, MeRF enables the model to be aware of the overall reward space by in-context motivation. (Right) We validate the Base model, RLVR model and MeRF model on the K&K Logic Puzzle dataset in two settings: w/ motivation and w/o motivation in the prompt. Different from the base model, the RLVR model achieves a slightly better performance in validation w/ motivation than w/o motivation after the RLVR training, even while the motivation is not involved in the training process, indicating a connection between the in-context motivation validation and the RLVR training guided by the reward function (as the motivation describes).
  • Figure 3: Pass@$k$ performance of MeRF and RLVR baseline during the training process (from 0 to 280 steps) on K&K Logic Puzzle. We compare the pass@1, pass@2, pass@4, and pass@8 performance at each step, where MeRF consistently outperforms the RLVR baseline in all metrics. More importantly, MeRF demonstrates a significant training efficiency over RLVR baseline, for example, achieving better pass@4 and pass@8 performance at step 140 than the final RLVR model (at step 280), while RLVR's performance of pass@4 and pass@8 hardly improves after step 140.
  • Figure 4: Comparison of Pass@8 and Pass@1 performance of MeRF and RLVR baseline on MATH500 dataset during the training process. MeRF outperforms the RLVR baseline consistently in both pass@8 and pass@1 metrics, while RLVR pass@8 performance hardly improves after step 80, demonstrating the effectiveness of MeRF in improving the math reasoning capabilities of LLMs.
  • Figure 5: Correct Ratio of generated answers of the training set during the training process on K&K Logic Puzzle dataset. MeRF consistently outperforms the RLVR baseline, demonstrating the better exploration ability encouraged by the in-context motivation for the model to get the best reward during the training process.
  • ...and 9 more figures