Reinforcement Learning Teachers of Test Time Scaling
Edoardo Cetin, Tianyu Zhao, Yujin Tang
TL;DR
The paper tackles the exploration problem in RL for open-ended reasoning by introducing Reinforcement-Learned Teachers (RLTs)—small 7B models prompted with both a question and its solution to produce instructional explanations aimed at downstream distillation. RLTs optimize dense rewards that combine student-probability-based signals and a KL-regularization term to align the teacher's explanations with the student's learning perspective, enabling effective distillation to larger or smaller students and even zero-shot domain transfer. Empirically, 7B RLTs outperform distillation pipelines built on much larger models on AIME, MATH, and GPQA, and they provide superior cold-start data for RL and robust zero-shot transfer to out-of-domain tasks like countdown. The framework significantly reduces the cost and complexity of RL-based reasoning by shifting the emphasis from solving problems from scratch to producing high-quality, reusable explanations for students across diverse tasks and domains.
Abstract
Training reasoning language models (LMs) with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL's exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply "connect-the-dots" with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem's solution. In practice, the raw outputs of a 7B RLT provide higher final performance on competition and graduate-level tasks than existing distillation and cold-starting pipelines that collect and postprocess the reasoning traces of orders of magnitude larger LMs. Furthermore, RLTs maintain their effectiveness when training larger students and when applied zero-shot to out-of-distribution tasks, unlocking new levels of efficiency and re-usability for the RL reasoning framework. Code available at: https://github.com/SakanaAI/RLT
