Reinforcement Learning vs. Distillation: Understanding Accuracy and Capability in LLM Reasoning
Minwu Kim, Anubhav Shrestha, Safal Shrestha, Aadim Nepal, Keith Ross
TL;DR
This work interrogates how reinforcement learning with verifiable rewards (RLVR) and distillation affect reasoning in LLMs, focusing on accuracy versus capability. It shows RLVR reliably increases single-answer accuracy but often fails to improve, or even harms, capability, particularly for hard problems, and finds RLVR yields higher-quality responses that emerge from the training dynamics rather than surface-length or reflection cues. Distillation from strong teachers can boost both accuracy and capability when new knowledge is introduced; however, distilling reasoning patterns alone tends to boost accuracy without expanding capability, echoing RLVR’s pattern. These results clarify when and why RLVR and distillation change model reasoning, offering guidance for designing training regimens that genuinely expand LLM capabilities in mathematics and beyond.
Abstract
Recent studies have shown that reinforcement learning with verifiable rewards (RLVR) enhances overall accuracy (pass@1) but often fails to improve capability (pass@k) of LLMs in reasoning tasks, while distillation can improve both. In this paper, we investigate the mechanisms behind these phenomena. First, we demonstrate that RLVR struggles to improve capability as it focuses on improving the accuracy of the easier questions to the detriment of the accuracy of the most difficult questions. Second, we show that RLVR does not merely increase the success probability for the easier questions, but in our small model settings, produces quality responses that were absent in its original output distribution. In addition, we show these responses are neither noticeably longer nor feature more reflection-related keywords, underscoring the need for more reliable indicators of response quality. Third, from the experiment distilling teacher responses to in-distribution problems, we find that capability does not always improve with distillation. We conjecture that capability improves only when new knowledge is introduced, whereas distilling reasoning patterns only improves accuracy but not capability, sacrificing performance on the most difficult questions, similar to RLVR. Together, these findings offer a clearer understanding of how RLVR and distillation shape reasoning behavior in LLMs
