Effectiveness of Chain-of-Thought in Distilling Reasoning Capability from Large Language Models

Cong-Thanh Do; Rama Doddipatla; Kate Knill

Effectiveness of Chain-of-Thought in Distilling Reasoning Capability from Large Language Models

Cong-Thanh Do, Rama Doddipatla, Kate Knill

TL;DR

This work investigates whether introducing Chain-of-Thought rationales into white-box knowledge distillation (KD) improves the transfer of reasoning from a large teacher LLM to a smaller student LLM. Using Qwen and Llama2 families, it trains on CoT-Collection rationales and evaluates on the BIG-Bench-Hard (BBH) benchmark with few-shot CoT prompts at inference. The KD+CoT approach yields consistent, though not universal, gains in average BBH performance over vanilla white-box KD, highlighting its potential to boost multi-step reasoning and language understanding in compact models. The findings support deploying more efficient LLMs with enhanced reasoning capabilities, maintaining favorable inference latency relative to larger models.

Abstract

Chain-of-Thought (CoT) prompting is a widely used method to improve the reasoning capability of Large Language Models (LLMs). More recently, CoT has been leveraged in Knowledge Distillation (KD) to transfer reasoning capability from a larger LLM to a smaller one. This paper examines the role of CoT in distilling the reasoning capability from larger LLMs to smaller LLMs using white-box KD, analysing its effectiveness in improving the performance of the distilled models for various natural language reasoning and understanding tasks. We conduct white-box KD experiments using LLMs from the Qwen and Llama2 families, employing CoT data from the CoT-Collection dataset. The distilled models are then evaluated on natural language reasoning and understanding tasks from the BIG-Bench-Hard (BBH) benchmark, which presents complex challenges for smaller LLMs. Experimental results demonstrate the role of CoT in improving white-box KD effectiveness, enabling the distilled models to achieve better average performance in natural language reasoning and understanding tasks from BBH.

Effectiveness of Chain-of-Thought in Distilling Reasoning Capability from Large Language Models

TL;DR

Abstract

Effectiveness of Chain-of-Thought in Distilling Reasoning Capability from Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)