Table of Contents
Fetching ...

Learning from Partial Chain-of-Thought via Truncated-Reasoning Self-Distillation

Gianluigi Silvestri, Edoardo Cetin

Abstract

Reasoning-oriented language models achieve strong performance by generating long chain-of-thought traces at inference time. However, this capability comes with substantial and often excessive computational cost, which can materialize in redundant or inefficient reasoning. We study this setting and introduce Truncated-Reasoning Self-Distillation (TRSD), a lightweight post-training procedure that encourages models to produce correct predictions from partial reasoning traces. In TRSD, a frozen teacher model first generates a full reasoning trace and evaluates the corresponding answer distribution conditioned on the prompt and the complete reasoning to construct a synthetic training target. A student model with the same architecture is then trained to match the teacher's answer distribution while being conditioned only on a truncated prefix of its reasoning trace. Across multiple reasoning benchmarks and token budgets, we demonstrate that TRSD improves robustness to truncated inference, with far reduced accuracy tradeoffs when applied to a diverse set of reasoning models. Moreover, although never explicitly regularized for shorter generation during training, we also find that TRSD-trained models inherently output shorter reasoning traces without truncation, significantly reducing inference-time costs even without artificial interventions.

Learning from Partial Chain-of-Thought via Truncated-Reasoning Self-Distillation

Abstract

Reasoning-oriented language models achieve strong performance by generating long chain-of-thought traces at inference time. However, this capability comes with substantial and often excessive computational cost, which can materialize in redundant or inefficient reasoning. We study this setting and introduce Truncated-Reasoning Self-Distillation (TRSD), a lightweight post-training procedure that encourages models to produce correct predictions from partial reasoning traces. In TRSD, a frozen teacher model first generates a full reasoning trace and evaluates the corresponding answer distribution conditioned on the prompt and the complete reasoning to construct a synthetic training target. A student model with the same architecture is then trained to match the teacher's answer distribution while being conditioned only on a truncated prefix of its reasoning trace. Across multiple reasoning benchmarks and token budgets, we demonstrate that TRSD improves robustness to truncated inference, with far reduced accuracy tradeoffs when applied to a diverse set of reasoning models. Moreover, although never explicitly regularized for shorter generation during training, we also find that TRSD-trained models inherently output shorter reasoning traces without truncation, significantly reducing inference-time costs even without artificial interventions.
Paper Structure (33 sections, 1 equation, 10 figures, 7 tables, 1 algorithm)

This paper contains 33 sections, 1 equation, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: Accuracy as a function of the available reasoning budget for a Qwen3-4B model on GSM8K. Truncated-Reasoning Self-Distillation (TRSD) substantially improves performance in low-budget regimes, enabling accurate predictions with limited reasoning.
  • Figure 2: Truncated-Reasoning Self-Distillation (TRSD). Given an input prompt $x$, a frozen teacher model first generates a full chain-of-thought reasoning trace $r$ and an answer $y$, and then evaluates the answer-token distribution $p_{\text{teacher}}(y\mid x,r)$ conditioned on the prompt and the complete reasoning trace. The trainable student model, initialized as a copy of the teacher, is conditioned only on a truncated prefix $\bar{r}$ of the teacher-generated reasoning trace, and evaluates the corresponding answer distribution $p_{\text{student}}(y\mid x,\bar{r})$. Training minimizes the KL divergence between the teacher and student answer distributions, encouraging the student to recover the same predictions from partial reasoning and to remain accurate when inference-time reasoning is truncated.
  • Figure 3: Per-dataset accuracy as a function of the reasoning budget for Qwen3-4B. The evaluation dataset is specified below the respective plot.
  • Figure 4: Per-dataset accuracy as a function of the reasoning budget for Phi-4-mini-reasoning. The evaluation dataset is specified below the respective plot.
  • Figure 5: Example where both models answer correctly, but the TRSD-trained model uses a substantially shorter reasoning trace. The example is taken verbatim from the Qwen3-4B GSM8K evaluation set.
  • ...and 5 more figures