Table of Contents
Fetching ...

Effectiveness of Chain-of-Thought in Distilling Reasoning Capability from Large Language Models

Cong-Thanh Do, Rama Doddipatla, Kate Knill

TL;DR

This work investigates whether introducing Chain-of-Thought rationales into white-box knowledge distillation (KD) improves the transfer of reasoning from a large teacher LLM to a smaller student LLM. Using Qwen and Llama2 families, it trains on CoT-Collection rationales and evaluates on the BIG-Bench-Hard (BBH) benchmark with few-shot CoT prompts at inference. The KD+CoT approach yields consistent, though not universal, gains in average BBH performance over vanilla white-box KD, highlighting its potential to boost multi-step reasoning and language understanding in compact models. The findings support deploying more efficient LLMs with enhanced reasoning capabilities, maintaining favorable inference latency relative to larger models.

Abstract

Chain-of-Thought (CoT) prompting is a widely used method to improve the reasoning capability of Large Language Models (LLMs). More recently, CoT has been leveraged in Knowledge Distillation (KD) to transfer reasoning capability from a larger LLM to a smaller one. This paper examines the role of CoT in distilling the reasoning capability from larger LLMs to smaller LLMs using white-box KD, analysing its effectiveness in improving the performance of the distilled models for various natural language reasoning and understanding tasks. We conduct white-box KD experiments using LLMs from the Qwen and Llama2 families, employing CoT data from the CoT-Collection dataset. The distilled models are then evaluated on natural language reasoning and understanding tasks from the BIG-Bench-Hard (BBH) benchmark, which presents complex challenges for smaller LLMs. Experimental results demonstrate the role of CoT in improving white-box KD effectiveness, enabling the distilled models to achieve better average performance in natural language reasoning and understanding tasks from BBH.

Effectiveness of Chain-of-Thought in Distilling Reasoning Capability from Large Language Models

TL;DR

This work investigates whether introducing Chain-of-Thought rationales into white-box knowledge distillation (KD) improves the transfer of reasoning from a large teacher LLM to a smaller student LLM. Using Qwen and Llama2 families, it trains on CoT-Collection rationales and evaluates on the BIG-Bench-Hard (BBH) benchmark with few-shot CoT prompts at inference. The KD+CoT approach yields consistent, though not universal, gains in average BBH performance over vanilla white-box KD, highlighting its potential to boost multi-step reasoning and language understanding in compact models. The findings support deploying more efficient LLMs with enhanced reasoning capabilities, maintaining favorable inference latency relative to larger models.

Abstract

Chain-of-Thought (CoT) prompting is a widely used method to improve the reasoning capability of Large Language Models (LLMs). More recently, CoT has been leveraged in Knowledge Distillation (KD) to transfer reasoning capability from a larger LLM to a smaller one. This paper examines the role of CoT in distilling the reasoning capability from larger LLMs to smaller LLMs using white-box KD, analysing its effectiveness in improving the performance of the distilled models for various natural language reasoning and understanding tasks. We conduct white-box KD experiments using LLMs from the Qwen and Llama2 families, employing CoT data from the CoT-Collection dataset. The distilled models are then evaluated on natural language reasoning and understanding tasks from the BIG-Bench-Hard (BBH) benchmark, which presents complex challenges for smaller LLMs. Experimental results demonstrate the role of CoT in improving white-box KD effectiveness, enabling the distilled models to achieve better average performance in natural language reasoning and understanding tasks from BBH.

Paper Structure

This paper contains 19 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: KD+CoT approach for distilling reasoning capability from a teacher LLM to a student LLM using white-box KD and CoT. Rationales—intermediate reasoning steps in CoT— are incorporated in the training data of KD+CoT. These rationales are not incorporated in the vanilla white-box KD's training data.
  • Figure 2: Examples of training data instances with integrated rationales that lead to the answers. These data are from the CoT-Collection dataset kim2023_emnlp.