Table of Contents
Fetching ...

Student-in-the-Loop Chain-of-Thought Distillation via Generation-Time Selection

Chaoqun He, Yingfa Chen, Chaojun Xiao, Xu Han, Lijie Wen

Abstract

Large reasoning models achieve strong performance on complex tasks through long chain-of-thought (CoT) trajectories, but directly transferring such reasoning processes to smaller models remains challenging. A key difficulty is that not all teacher-generated reasoning trajectories are suitable for student learning. Existing approaches typically rely on post-hoc filtering, selecting trajectories after full generation based on heuristic criteria. However, such methods cannot control the generation process itself and may still produce reasoning paths that lie outside the student's learning capacity. To address this limitation, we propose Gen-SSD (Generation-time Self-Selection Distillation), a student-in-the-loop framework that performs generation-time selection. Instead of passively consuming complete trajectories, the student evaluates candidate continuations during the teacher's sampling process, guiding the expansion of only learnable reasoning paths and enabling early pruning of unhelpful branches. Experiments on mathematical reasoning benchmarks demonstrate that Gen-SSD consistently outperforms standard knowledge distillation and recent baselines, with improvements of around 5.9 points over Standard KD and up to 4.7 points over other baselines. Further analysis shows that Gen-SSD produces more stable and learnable reasoning trajectories, highlighting the importance of incorporating supervision during generation for effective distillation.

Student-in-the-Loop Chain-of-Thought Distillation via Generation-Time Selection

Abstract

Large reasoning models achieve strong performance on complex tasks through long chain-of-thought (CoT) trajectories, but directly transferring such reasoning processes to smaller models remains challenging. A key difficulty is that not all teacher-generated reasoning trajectories are suitable for student learning. Existing approaches typically rely on post-hoc filtering, selecting trajectories after full generation based on heuristic criteria. However, such methods cannot control the generation process itself and may still produce reasoning paths that lie outside the student's learning capacity. To address this limitation, we propose Gen-SSD (Generation-time Self-Selection Distillation), a student-in-the-loop framework that performs generation-time selection. Instead of passively consuming complete trajectories, the student evaluates candidate continuations during the teacher's sampling process, guiding the expansion of only learnable reasoning paths and enabling early pruning of unhelpful branches. Experiments on mathematical reasoning benchmarks demonstrate that Gen-SSD consistently outperforms standard knowledge distillation and recent baselines, with improvements of around 5.9 points over Standard KD and up to 4.7 points over other baselines. Further analysis shows that Gen-SSD produces more stable and learnable reasoning trajectories, highlighting the importance of incorporating supervision during generation for effective distillation.

Paper Structure

This paper contains 25 sections, 5 equations, 6 figures, 10 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison between standard KD and our proposed Gen-SSD.
  • Figure 2: Overview of Gen-SSD. The student actively participates in the teacher's multi-sample generation process. At each chunk, the student evaluates candidate continuations with PPL and selects the fragments best aligned with its capability, thereby influencing the teacher's sampling trajectory. For unsuitable candidates, generation is terminated early, which reduces inference cost and improves sampling efficiency.
  • Figure 3: Average performance of Gen-SSD across benchmarks under different chunk sizes. Detailed results are provided in Table \ref{['tab:benchmark_exp']} in the appendix.
  • Figure 4: Performance improvements of Gen-SSD over Standard KD across different teacher models, where the y-axis represents the average gain on Table \ref{['tab:main_exp']} tasks.
  • Figure 5: Average PPL of training data under different generation methods. S-D: Self-Distillation; N-S: No Selection; middle values: Gen-SSD with different chunk sizes.
  • ...and 1 more figures