Table of Contents
Fetching ...

Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning

Xinghao Chen, Zhijing Sun, Wenjin Guo, Miaoran Zhang, Yanjun Chen, Yirong Sun, Hui Su, Yijie Pan, Dietrich Klakow, Wenjie Li, Xiaoyu Shen

TL;DR

<3-5 sentence high-level summary> The paper studies how to distill chain-of-thought reasoning from powerful teachers into Small Language Models by jointly considering three factors: choice of teacher, the granularity of reasoning, and the format of presentation. Through experiments with four teacher models and seven student models across seven math and commonsense datasets, it uncovers a non-monotonic relation between granularity and SLM performance, a limited effect of CoT format on SLMs, and that stronger teachers do not always yield stronger students. It further reveals a task- and model-dependent Matthew effect, where stronger students gain more from reasoning supervision while weaker students struggle with overly detailed CoTs. These findings provide actionable guidance for efficient CoT distillation in resource-constrained settings and point to adaptive, curriculum-like approaches.

Abstract

Large Language Models (LLMs) excel in reasoning tasks through Chain-of-Thought (CoT) prompting. However, CoT prompting greatly increases computational demands, which has prompted growing interest in distilling CoT capabilities into Small Language Models (SLMs). This study systematically examines the factors influencing CoT distillation, including the choice of granularity, format and teacher model. Through experiments involving four teacher models and seven student models across seven mathematical and commonsense reasoning datasets, we uncover three key findings: (1) Unlike LLMs, SLMs exhibit a non-monotonic relationship with granularity, with stronger models benefiting from finer-grained reasoning and weaker models performing better with simpler CoT supervision; (2) CoT format significantly impacts LLMs but has minimal effect on SLMs, likely due to their reliance on supervised fine-tuning rather than pretraining preferences; (3) Stronger teacher models do NOT always produce better student models, as diversity and complexity in CoT supervision can outweigh accuracy alone. These findings emphasize the need to tailor CoT strategies to specific student model, offering actionable insights for optimizing CoT distillation in SLMs. The code and datasets are available at https://github.com/EIT-NLP/Distilling-CoT-Reasoning.

Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning

TL;DR

<3-5 sentence high-level summary> The paper studies how to distill chain-of-thought reasoning from powerful teachers into Small Language Models by jointly considering three factors: choice of teacher, the granularity of reasoning, and the format of presentation. Through experiments with four teacher models and seven student models across seven math and commonsense datasets, it uncovers a non-monotonic relation between granularity and SLM performance, a limited effect of CoT format on SLMs, and that stronger teachers do not always yield stronger students. It further reveals a task- and model-dependent Matthew effect, where stronger students gain more from reasoning supervision while weaker students struggle with overly detailed CoTs. These findings provide actionable guidance for efficient CoT distillation in resource-constrained settings and point to adaptive, curriculum-like approaches.

Abstract

Large Language Models (LLMs) excel in reasoning tasks through Chain-of-Thought (CoT) prompting. However, CoT prompting greatly increases computational demands, which has prompted growing interest in distilling CoT capabilities into Small Language Models (SLMs). This study systematically examines the factors influencing CoT distillation, including the choice of granularity, format and teacher model. Through experiments involving four teacher models and seven student models across seven mathematical and commonsense reasoning datasets, we uncover three key findings: (1) Unlike LLMs, SLMs exhibit a non-monotonic relationship with granularity, with stronger models benefiting from finer-grained reasoning and weaker models performing better with simpler CoT supervision; (2) CoT format significantly impacts LLMs but has minimal effect on SLMs, likely due to their reliance on supervised fine-tuning rather than pretraining preferences; (3) Stronger teacher models do NOT always produce better student models, as diversity and complexity in CoT supervision can outweigh accuracy alone. These findings emphasize the need to tailor CoT strategies to specific student model, offering actionable insights for optimizing CoT distillation in SLMs. The code and datasets are available at https://github.com/EIT-NLP/Distilling-CoT-Reasoning.

Paper Structure

This paper contains 45 sections, 14 equations, 16 figures, 10 tables, 1 algorithm.

Figures (16)

  • Figure 1: Overview of CoT Distillation. Different teacher models generate CoT supervision with varying levels of granularity and formats to fine-tune the student model.
  • Figure 2: Performance of student models with different granularity. Most models achieve peak accuracy at intermediate granularity levels.
  • Figure 3: Scatter plots of teacher model (GPT-4o, x-axis) vs. student accuracy (y-axis) across datasets and granularity levels. Each point marker represents a specific dataset.
  • Figure 4: Scatter plots of teacher (x-axis) vs. student model accuracy (y-axis) across datasets. GPT refers to GPT-4o, LLaMA refers to LLaMA 3 70B, and Gemini refers to Gemini-1.5-Flash.
  • Figure 5: Student model performance across different teacher models. Each bar represents the average accuracy of a specific student model trained on CoT from different teacher models.
  • ...and 11 more figures