Table of Contents
Fetching ...

The Valley of Code Reasoning: Scaling Knowledge Distillation of Large Language Models

Muyu He, Muhammad Ali Shafique, Anand Kumar, Tsach Mackey, Nazneen Rajani

TL;DR

This work investigates how knowledge distillation of code reasoning from LLMs to small models scales with distillation data, revealing a non-monotonic valley where performance dips before improving as data increases. Using two instruction-tuned LLMs and three data regimes, the authors show that easier coding questions yield larger gains than harder ones in low-to-mid data settings, and that the correctness of the teacher's outputs has little effect on distillation outcomes. They pose three research questions about data quantity, output correctness, and problem difficulty, and validate them with LiveCodeBench and OCR2-derived data, reporting consistent valley dynamics and log-linear gains in downstream measures such as LCB scores, completion rates, and think-tag usage. The findings offer practical guidance for data curation in code-reasoning distillation and point to future work exploring larger data regimes beyond common scales. The dataset splits used are open-sourced to support replication and extension.

Abstract

Distilling the thinking traces of a Large Language Model (LLM) with reasoning capabilities into a smaller model has been proven effective. Yet, there is a scarcity of work done on how model performances scale with the quantity of distillation data. In this work, we study the scaling trend of distilling competitive coding skills on two small non-reasoning LLMs. We validate the hypothesis that there is a $\textit{valley of code reasoning}$: downstream performance on competitive coding first drops as data quantity increases, then it steadily increases in a sharper-than-log-linear fashion. Having identified the trend, we further fine-tune the models at two different distillation stages on the same data to ground conclusions on their respective learning phases. We learn that across stages in the low and medium-low data regimes, small models benefit significantly from easier coding questions than from harder ones. We also find that, surprisingly, the correctness of outputs in training data makes no difference to distillation outcomes. Our work represents a step forward in understanding the training dynamics of code reasoning distillation outside intuition

The Valley of Code Reasoning: Scaling Knowledge Distillation of Large Language Models

TL;DR

This work investigates how knowledge distillation of code reasoning from LLMs to small models scales with distillation data, revealing a non-monotonic valley where performance dips before improving as data increases. Using two instruction-tuned LLMs and three data regimes, the authors show that easier coding questions yield larger gains than harder ones in low-to-mid data settings, and that the correctness of the teacher's outputs has little effect on distillation outcomes. They pose three research questions about data quantity, output correctness, and problem difficulty, and validate them with LiveCodeBench and OCR2-derived data, reporting consistent valley dynamics and log-linear gains in downstream measures such as LCB scores, completion rates, and think-tag usage. The findings offer practical guidance for data curation in code-reasoning distillation and point to future work exploring larger data regimes beyond common scales. The dataset splits used are open-sourced to support replication and extension.

Abstract

Distilling the thinking traces of a Large Language Model (LLM) with reasoning capabilities into a smaller model has been proven effective. Yet, there is a scarcity of work done on how model performances scale with the quantity of distillation data. In this work, we study the scaling trend of distilling competitive coding skills on two small non-reasoning LLMs. We validate the hypothesis that there is a : downstream performance on competitive coding first drops as data quantity increases, then it steadily increases in a sharper-than-log-linear fashion. Having identified the trend, we further fine-tune the models at two different distillation stages on the same data to ground conclusions on their respective learning phases. We learn that across stages in the low and medium-low data regimes, small models benefit significantly from easier coding questions than from harder ones. We also find that, surprisingly, the correctness of outputs in training data makes no difference to distillation outcomes. Our work represents a step forward in understanding the training dynamics of code reasoning distillation outside intuition

Paper Structure

This paper contains 9 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Evaluation scores on LCB of two distilled small models show that performances initially decreases by half but then steadily increases in a log-linear trend toward the 30K data upper bound.