Table of Contents
Fetching ...

Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han, Gufeng Zhang, Yanfei Chen, Wei Wang, Tomas Pfister, Chen-Yu Lee

TL;DR

This work introduces Supervised Reinforcement Learning (SRL), a framework that reframes difficult reasoning tasks as sequential actions paired with an internal monologue. It provides dense, step-wise rewards based on the similarity between the model’s actions and expert actions, enabling learning even when final outcomes are incorrect. Empirical results show SRL outperforms SFT and RLVR on challenging math reasoning benchmarks and extends effectively to software engineering agentic tasks, with the SRL→RLVR pipeline delivering the strongest gains. The approach offers a robust, generalizable curriculum for reasoning-oriented LLMs and demonstrates flexible planning and self-verification behaviors beyond mere output length increases.

Abstract

Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions". SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model's actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Moreover, initializing training with SRL before refining with RLVR yields the strongest overall performance. Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.

Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

TL;DR

This work introduces Supervised Reinforcement Learning (SRL), a framework that reframes difficult reasoning tasks as sequential actions paired with an internal monologue. It provides dense, step-wise rewards based on the similarity between the model’s actions and expert actions, enabling learning even when final outcomes are incorrect. Empirical results show SRL outperforms SFT and RLVR on challenging math reasoning benchmarks and extends effectively to software engineering agentic tasks, with the SRL→RLVR pipeline delivering the strongest gains. The approach offers a robust, generalizable curriculum for reasoning-oriented LLMs and demonstrates flexible planning and self-verification behaviors beyond mere output length increases.

Abstract

Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions". SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model's actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Moreover, initializing training with SRL before refining with RLVR yields the strongest overall performance. Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.

Paper Structure

This paper contains 19 sections, 13 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Performance of our method (SRL) against baselines on math reasoning benchmarks, with all models trained on the challenging s1k dataset muennighoff2025s1. Our key observations are: (1) Directly applying SFT on this dataset leads to performance degradation compared to the base model. (2) While RLVR can improve generalization over SFT, the gains are marginal. (3) Our proposed SRL method substantially outperforms these baselines, and the SRL $\rightarrow$ RLVR pipeline achieves the highest performance, overcoming the challenges of training on difficult data.
  • Figure 2: Illustration of SRL as compared to RL(VR) and SFT. (a) RL(VR) takes a query as input and performs k rollouts. The final answer correctness is used as the reward. (b) SFT uses both a query $\mathbf{x}$ and a complete teacher response $\mathbf{y}$ as input, training with a per-token loss to maximize the probability $p(\mathbf{y}|\mathbf{x})$. (c) SRL also uses a query and a teacher response. It breaks the response into step actions and, at each step, uses the previous steps as context. The model generates a next step action along with its step-wise inner thoughts, and the reward $r_k$ is based on the similarity between the model's and the teacher's action.
  • Figure 3: Given a solution trajectory, we take each summarized step as an action to be learned and take the partial solution before the step as the context of our newly created data. The model is then prompted to generate its thinking process followed by the action for the current step. A reward ($r_2$ in the figure) is then calculated based on the similarity between the model's and the expert's action.
  • Figure 4: Reasoning length distribution for base model and model trained with SRL.
  • Figure 5: Illustration of applying SRL to SWE tasks. We take two rounds of the past expert actions and corresponding observations in context and prompt the LLM to think before reaches the next action. The model action is compared with the expert action to compute the sequence similarity reward.