Table of Contents
Fetching ...

Reinforcement Learning Teachers of Test Time Scaling

Edoardo Cetin, Tianyu Zhao, Yujin Tang

TL;DR

The paper tackles the exploration problem in RL for open-ended reasoning by introducing Reinforcement-Learned Teachers (RLTs)—small 7B models prompted with both a question and its solution to produce instructional explanations aimed at downstream distillation. RLTs optimize dense rewards that combine student-probability-based signals and a KL-regularization term to align the teacher's explanations with the student's learning perspective, enabling effective distillation to larger or smaller students and even zero-shot domain transfer. Empirically, 7B RLTs outperform distillation pipelines built on much larger models on AIME, MATH, and GPQA, and they provide superior cold-start data for RL and robust zero-shot transfer to out-of-domain tasks like countdown. The framework significantly reduces the cost and complexity of RL-based reasoning by shifting the emphasis from solving problems from scratch to producing high-quality, reusable explanations for students across diverse tasks and domains.

Abstract

Training reasoning language models (LMs) with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL's exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply "connect-the-dots" with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem's solution. In practice, the raw outputs of a 7B RLT provide higher final performance on competition and graduate-level tasks than existing distillation and cold-starting pipelines that collect and postprocess the reasoning traces of orders of magnitude larger LMs. Furthermore, RLTs maintain their effectiveness when training larger students and when applied zero-shot to out-of-distribution tasks, unlocking new levels of efficiency and re-usability for the RL reasoning framework. Code available at: https://github.com/SakanaAI/RLT

Reinforcement Learning Teachers of Test Time Scaling

TL;DR

The paper tackles the exploration problem in RL for open-ended reasoning by introducing Reinforcement-Learned Teachers (RLTs)—small 7B models prompted with both a question and its solution to produce instructional explanations aimed at downstream distillation. RLTs optimize dense rewards that combine student-probability-based signals and a KL-regularization term to align the teacher's explanations with the student's learning perspective, enabling effective distillation to larger or smaller students and even zero-shot domain transfer. Empirically, 7B RLTs outperform distillation pipelines built on much larger models on AIME, MATH, and GPQA, and they provide superior cold-start data for RL and robust zero-shot transfer to out-of-domain tasks like countdown. The framework significantly reduces the cost and complexity of RL-based reasoning by shifting the emphasis from solving problems from scratch to producing high-quality, reusable explanations for students across diverse tasks and domains.

Abstract

Training reasoning language models (LMs) with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL's exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply "connect-the-dots" with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem's solution. In practice, the raw outputs of a 7B RLT provide higher final performance on competition and graduate-level tasks than existing distillation and cold-starting pipelines that collect and postprocess the reasoning traces of orders of magnitude larger LMs. Furthermore, RLTs maintain their effectiveness when training larger students and when applied zero-shot to out-of-distribution tasks, unlocking new levels of efficiency and re-usability for the RL reasoning framework. Code available at: https://github.com/SakanaAI/RLT

Paper Structure

This paper contains 38 sections, 4 equations, 12 figures, 14 tables.

Figures (12)

  • Figure 1: RLTs provide better student distillation and RL cold-starts than orders of magnitude larger LMs across competition and graduate-level tasks (AIME, MATH, GPQA). This holds when distilling students of the same size (Left) and also 32B students, much larger than the RLT itself (Right).
  • Figure 2: Left: RL format asking an LM to think and solve hard problems from scratch. Right: RLT format asking an LM to produce instructive step-by-step explanations given access to the solutions.
  • Figure 3: The tokens from the RLT's explanations are copied into the student format to measure its understanding with our reward terms.
  • Figure 4: Left: Out-of-distribution performance transferring RLTs to produce new distillation data as compared to students trained on the reas_sd_sky_t1 corpus and direct RL on the countdown task. Right: Performance after training on different distillation datasets ranked by the RLT reward.
  • Figure 5: Examples comparing the contents from the post-processed R1 traces bespoke_stratos that were particularly improved by the corresponding RLT explanations as measured by our reward function
  • ...and 7 more figures