Table of Contents
Fetching ...

Explain in Your Own Words: Improving Reasoning via Token-Selective Dual Knowledge Distillation

Minsang Kim, Seung Jun Baek

Abstract

Knowledge Distillation (KD) can transfer the reasoning abilities of large models to smaller ones, which can reduce the costs to generate Chain-of-Thoughts for reasoning tasks. KD methods typically ask the student to mimic the teacher's distribution over the entire output. However, a student with limited capacity can be overwhelmed by such extensive supervision causing a distribution mismatch, especially in complex reasoning tasks. We propose Token-Selective Dual Knowledge Distillation (TSD-KD), a framework for student-centric distillation. TSD-KD focuses on distilling important tokens for reasoning and encourages the student to explain reasoning in its own words. TSD-KD combines indirect and direct distillation. Indirect distillation uses a weak form of feedback based on preference ranking. The student proposes candidate responses generated on its own; the teacher re-ranks those candidates as indirect feedback without enforcing its entire distribution. Direct distillation uses distribution matching; however, it selectively distills tokens based on the relative confidence between teacher and student. Finally, we add entropy regularization to maintain the student's confidence during distillation. Overall, our method provides the student with targeted and indirect feedback to support its own reasoning process and to facilitate self-improvement. The experiments show the state-of-the-art performance of TSD-KD on 10 challenging reasoning benchmarks, outperforming the baseline and runner-up in accuracy by up to 54.4\% and 40.3\%, respectively. Notably, a student trained by TSD-KD even outperformed its own teacher model in four cases by up to 20.3\%. The source code is available at https://github.com/kmswin1/TSD-KD.

Explain in Your Own Words: Improving Reasoning via Token-Selective Dual Knowledge Distillation

Abstract

Knowledge Distillation (KD) can transfer the reasoning abilities of large models to smaller ones, which can reduce the costs to generate Chain-of-Thoughts for reasoning tasks. KD methods typically ask the student to mimic the teacher's distribution over the entire output. However, a student with limited capacity can be overwhelmed by such extensive supervision causing a distribution mismatch, especially in complex reasoning tasks. We propose Token-Selective Dual Knowledge Distillation (TSD-KD), a framework for student-centric distillation. TSD-KD focuses on distilling important tokens for reasoning and encourages the student to explain reasoning in its own words. TSD-KD combines indirect and direct distillation. Indirect distillation uses a weak form of feedback based on preference ranking. The student proposes candidate responses generated on its own; the teacher re-ranks those candidates as indirect feedback without enforcing its entire distribution. Direct distillation uses distribution matching; however, it selectively distills tokens based on the relative confidence between teacher and student. Finally, we add entropy regularization to maintain the student's confidence during distillation. Overall, our method provides the student with targeted and indirect feedback to support its own reasoning process and to facilitate self-improvement. The experiments show the state-of-the-art performance of TSD-KD on 10 challenging reasoning benchmarks, outperforming the baseline and runner-up in accuracy by up to 54.4\% and 40.3\%, respectively. Notably, a student trained by TSD-KD even outperformed its own teacher model in four cases by up to 20.3\%. The source code is available at https://github.com/kmswin1/TSD-KD.
Paper Structure (29 sections, 1 theorem, 30 equations, 10 figures, 10 tables, 2 algorithms)

This paper contains 29 sections, 1 theorem, 30 equations, 10 figures, 10 tables, 2 algorithms.

Key Result

Proposition 1

Suppose the student's reward for response $y$ given input $x$ is implicitly defined as Similarly define the teacher's reward function $r_T(x,y):=\log~p_T(y|x)$. Suppose top-2 tokens at position $t$ are $y_t^{(1)},y_t^{(2)}$, and sub-responses up to $t$ are given as in (eq:subresponse). Consider the preference model where teacher's reward determines the preference label for $y^{(1)}_{\ and vice ve

Figures (10)

  • Figure 1: Average per token entropy of reasoning traces for yuan2025advancing.
  • Figure 2: Overview of TSD-KD. 1)Indirect distillation. Unlike traditional KD, the student actively suggests reasoning candidates, and the teacher chooses from them. 1.1) The student generates a reasoning response. 1.2) Early part of the response containing important branching of reasoning (called opener) is selected. 1.3) The student sequentially generates candidates of partial responses from the opener. Top candidates are proposed to the teacher; for example, ① "To solve x+3=5, we subtract 5" and ②"To solve x+3=5, we subtract 3". The teacher provides preference ranking (prefers ②) for better reasoning. The preference signal is used as an indirect form of distillation. 2)Direct distillation performs selective distillation of critical tokens about which the student is uncertain but the teacher is confident. 3)Entropy regularization minimizes the entropy of critical tokens, reducing uncertainty and maintaining the student's confidence during distillation.
  • Figure 3: Performance with varying $c$.
  • Figure 4: Performances using different top-$k$ of $\mathcal{L}_\text{Indirect}$.
  • Figure 5: Performances with token selection ratio for entropy regularization.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Proposition 1