Table of Contents
Fetching ...

Distilling Reasoning into Student LLMs: Local Naturalness for Selecting Teacher Data

Hoang Anh Just, Myeongseob Ko, Ruoxi Jia

TL;DR

The paper tackles the challenge of distilling long-chain reasoning from multiple teacher LLMs into smaller student models by showing that traditional global naturalness metrics fail in mixed-teacher, long-CoT settings. It introduces Local Naturalness, a step-level local log-probability measure that assesses reasoning steps within a short context, enabling effective teacher and response selection and data mixing. Empirical results across math benchmarks and cross-domain tasks show that Local Naturalness consistently improves downstream performance, often surpassing training data from the best single teacher. The approach offers a practical, data-efficient path to enhanced reasoning in student models and highlights the importance of localized data quality over global sequence-level assessments.

Abstract

Distilling long reasoning traces (10K+ tokens) from stronger teacher models into smaller student LLMs via SFT has emerged as a standard paradigm. This approach is practical and efficient: it leverages the ease of generating abundant reasoning data from stronger models and provides a direct, data-driven way to teach less capable models better reasoning. While previous work has largely focused on prompt selection with responses from a single teacher, the equally important problem of choosing the best response when multiple teacher outputs are available for a single prompt remains underexplored. This challenge becomes important in a multi-teacher setting, where different students may benefit from the outputs of different teachers. This paper fills that gap with a systematic study of response selection for reasoning distillation. We first show that the current method, which picks responses the student assigns the highest global log-probability (global naturalness), fails when responses come from multiple teachers, i.e., global naturalness no longer correlates with downstream performance, especially as the reasoning traces from strong teachers become longer. To overcome this problem, we introduce Local Naturalness, which measures the student's log-probabilities over short, sequential reasoning steps conditioned only on a small local window. Local Naturalness enables two applications: 1) Teacher Selection: Aggregating local scores across prompts reliably identifies the most helpful teacher. 2) Response Selection from a Multiple Teachers: When mixing answers from many teachers, Local Naturalness boosts a 32B student's accuracy on math benchmarks by 9.4pp over global selection, also surpassing the performance achieved by training on data from the single best teacher. These results highlight the power of localized data quality evaluation and data mixing for more effective reasoning distillation.

Distilling Reasoning into Student LLMs: Local Naturalness for Selecting Teacher Data

TL;DR

The paper tackles the challenge of distilling long-chain reasoning from multiple teacher LLMs into smaller student models by showing that traditional global naturalness metrics fail in mixed-teacher, long-CoT settings. It introduces Local Naturalness, a step-level local log-probability measure that assesses reasoning steps within a short context, enabling effective teacher and response selection and data mixing. Empirical results across math benchmarks and cross-domain tasks show that Local Naturalness consistently improves downstream performance, often surpassing training data from the best single teacher. The approach offers a practical, data-efficient path to enhanced reasoning in student models and highlights the importance of localized data quality over global sequence-level assessments.

Abstract

Distilling long reasoning traces (10K+ tokens) from stronger teacher models into smaller student LLMs via SFT has emerged as a standard paradigm. This approach is practical and efficient: it leverages the ease of generating abundant reasoning data from stronger models and provides a direct, data-driven way to teach less capable models better reasoning. While previous work has largely focused on prompt selection with responses from a single teacher, the equally important problem of choosing the best response when multiple teacher outputs are available for a single prompt remains underexplored. This challenge becomes important in a multi-teacher setting, where different students may benefit from the outputs of different teachers. This paper fills that gap with a systematic study of response selection for reasoning distillation. We first show that the current method, which picks responses the student assigns the highest global log-probability (global naturalness), fails when responses come from multiple teachers, i.e., global naturalness no longer correlates with downstream performance, especially as the reasoning traces from strong teachers become longer. To overcome this problem, we introduce Local Naturalness, which measures the student's log-probabilities over short, sequential reasoning steps conditioned only on a small local window. Local Naturalness enables two applications: 1) Teacher Selection: Aggregating local scores across prompts reliably identifies the most helpful teacher. 2) Response Selection from a Multiple Teachers: When mixing answers from many teachers, Local Naturalness boosts a 32B student's accuracy on math benchmarks by 9.4pp over global selection, also surpassing the performance achieved by training on data from the single best teacher. These results highlight the power of localized data quality evaluation and data mixing for more effective reasoning distillation.

Paper Structure

This paper contains 43 sections, 3 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: We contrast two data selection metrics: the global log-probability, computed with the entire preceding context, and the local log-probability, computed within a local sliding window. Model performance is reported both before and after SFT when training data are selected by three strategies---Random sampling, Global scoring, and our proposed Local scoring.
  • Figure 2: Average Performance vs Global Log Probabilities (scaled by $10^2$) of the Data used for Model Training on Teacher Data (Left) Qwen-7B-Instruct as a Student (Right) Qwen-Math-7B as a student for short data.
  • Figure 3: Average Performance vs Global and Local Log Probabilities (scaled by $10^2$) of the Long Reasoning Data from three Teacher models used for Model Training on Teacher LIMO Data (Left) Qwen-32B-Instruct as a Student (Right) Qwen-7B-Instruct as a student.
  • Figure 4: Average log probabilities (scaled by $10^2$) with increasing context window showing the convergence to the global log probabilities ranking. Qwen-7B-Instruct as a student model trained with LIMO response from the teacher (avg SFT performance reported).
  • Figure 5: Loss plots of the student model Qwen2.5-32B-Instruct trained on randomly selected data points from LIMO responses, highest local log probabilities responses, and highest global log probabilities responses.
  • ...and 1 more figures