Table of Contents
Fetching ...

RLKD: Distilling LLMs' Reasoning via Reinforcement Learning

Shicheng Xu, Liang Pang, Yunchang Zhu, Jia Gu, Zihao Wei, Jingcheng Deng, Feiyang Pan, Huawei Shen, Xueqi Cheng

TL;DR

This work tackles the mismatch between distilling reasoning and authentic multi-branch reasoning in LLMs. It introduces RLKD, a reinforcement-learning-based distillation framework guided by a Generative Structure Reward Model (GSRM) that decomposes reasoning into meta-reasoning and solving steps and provides step-level structural rewards. By optimizing with GRPO, RLKD trains students to internalize the teacher's implicit reasoning branches rather than merely mimicking surface tokens, achieving superior results with minimal data and greater path diversity than SFT baselines. Empirical results on math and graduate-level QA demonstrate robust improvements over SFT-RL and RL baselines, suggesting a practical pathway to more capable, data-efficient reasoning in smaller LLMs. Code is available at https://github.com/xsc1234/RLKD.

Abstract

Distilling reasoning paths from teacher to student models via supervised fine-tuning (SFT) provides a shortcut for improving the reasoning ability of smaller Large Language Models (LLMs). However, the reasoning paths generated by teacher models often reflect only surface-level traces of their underlying authentic reasoning. Insights from cognitive neuroscience suggest that authentic reasoning involves a complex interweaving between meta-reasoning (which selects appropriate sub-problems from multiple candidates) and solving (which addresses the sub-problem). This implies authentic reasoning has an implicit multi-branch structure. Supervised fine-tuning collapses this rich structure into a flat sequence of token prediction in the teacher's reasoning path, preventing effective distillation of this structure to students. To address this limitation, we propose RLKD, a reinforcement learning (RL)-based distillation framework guided by a novel Generative Structure Reward Model (GSRM). Our GSRM converts reasoning paths into multiple meta-reasoning-solving steps and computes rewards to measure structural alignment between student and teacher reasoning. RLKD combines this reward with RL, enabling student LLMs to internalize the teacher's implicit multi-branch reasoning structure rather than merely mimicking fixed output paths. Experiments show RLKD surpasses standard SFT-RL pipelines even when trained on 0.1% of data under an RL-only regime, unlocking greater student reasoning potential than SFT-based distillation. Code is available at https://github.com/xsc1234/RLKD.

RLKD: Distilling LLMs' Reasoning via Reinforcement Learning

TL;DR

This work tackles the mismatch between distilling reasoning and authentic multi-branch reasoning in LLMs. It introduces RLKD, a reinforcement-learning-based distillation framework guided by a Generative Structure Reward Model (GSRM) that decomposes reasoning into meta-reasoning and solving steps and provides step-level structural rewards. By optimizing with GRPO, RLKD trains students to internalize the teacher's implicit reasoning branches rather than merely mimicking surface tokens, achieving superior results with minimal data and greater path diversity than SFT baselines. Empirical results on math and graduate-level QA demonstrate robust improvements over SFT-RL and RL baselines, suggesting a practical pathway to more capable, data-efficient reasoning in smaller LLMs. Code is available at https://github.com/xsc1234/RLKD.

Abstract

Distilling reasoning paths from teacher to student models via supervised fine-tuning (SFT) provides a shortcut for improving the reasoning ability of smaller Large Language Models (LLMs). However, the reasoning paths generated by teacher models often reflect only surface-level traces of their underlying authentic reasoning. Insights from cognitive neuroscience suggest that authentic reasoning involves a complex interweaving between meta-reasoning (which selects appropriate sub-problems from multiple candidates) and solving (which addresses the sub-problem). This implies authentic reasoning has an implicit multi-branch structure. Supervised fine-tuning collapses this rich structure into a flat sequence of token prediction in the teacher's reasoning path, preventing effective distillation of this structure to students. To address this limitation, we propose RLKD, a reinforcement learning (RL)-based distillation framework guided by a novel Generative Structure Reward Model (GSRM). Our GSRM converts reasoning paths into multiple meta-reasoning-solving steps and computes rewards to measure structural alignment between student and teacher reasoning. RLKD combines this reward with RL, enabling student LLMs to internalize the teacher's implicit multi-branch reasoning structure rather than merely mimicking fixed output paths. Experiments show RLKD surpasses standard SFT-RL pipelines even when trained on 0.1% of data under an RL-only regime, unlocking greater student reasoning potential than SFT-based distillation. Code is available at https://github.com/xsc1234/RLKD.

Paper Structure

This paper contains 24 sections, 6 equations, 20 figures, 1 table, 1 algorithm.

Figures (20)

  • Figure 1: (a) The generated reasoning path has implicit multi-branch structure. (b) Distillation only based on SFT collapses the rich structure into a flat sequence of token prediction to memorize only the teacher's generated path. (c) Our proposed RL-based distillation can teach the student LLM to learn this structure by using a Generative Structure Reward Model to measure the alignment between the reasoning structure of the student and teacher, serving as the reward in RL.
  • Figure 2: One sequence generation example for math task in in-context learning prompts for GPT-4o.
  • Figure 3: Accuracy
  • Figure 4: Completion Length
  • Figure 5: Alignment of the Structures
  • ...and 15 more figures