Learning to Maximize Mutual Information for Chain-of-Thought Distillation

Xin Chen; Hanxian Huang; Yanjun Gao; Yi Wang; Jishen Zhao; Ke Ding

Learning to Maximize Mutual Information for Chain-of-Thought Distillation

Xin Chen, Hanxian Huang, Yanjun Gao, Yi Wang, Jishen Zhao, Ke Ding

TL;DR

This work reframes chain-of-thought distillation as an information bottleneck problem, introducing a variational mutual-information loss to tightly couple label-prediction and rationale-generation representations. By maximizing the mutual information between task representations $Z$ and $V$ while controlling information flow, the approach aligns CoT reasoning with predictive objectives, improving knowledge transfer from Large Language Models to smaller students. Empirical results on four datasets with T5-base and T5-small show consistent gains over standard finetuning, single-task, and the prior DSS method, along with calibrated predictions and higher-quality CoT outputs as assessed by GPT-4. The findings offer a principled direction for CoT-based distillation and suggest future integration of information-theoretic objectives in language-model compression.

Abstract

Knowledge distillation, the technique of transferring knowledge from large, complex models to smaller ones, marks a pivotal step towards efficient AI deployment. Distilling Step-by-Step~(DSS), a novel method utilizing chain-of-thought~(CoT) distillation, has demonstrated promise by imbuing smaller models with the superior reasoning capabilities of their larger counterparts. In DSS, the distilled model acquires the ability to generate rationales and predict labels concurrently through a multi-task learning framework. However, DSS overlooks the intrinsic relationship between the two training tasks, leading to ineffective integration of CoT knowledge with the task of label prediction. To this end, we investigate the mutual relationship of the two tasks from Information Bottleneck perspective and formulate it as maximizing the mutual information of the representation features of the two tasks. We propose a variational approach to solve this optimization problem using a learning-based method. Our experimental results across four datasets demonstrate that our method outperforms the state-of-the-art DSS. Our findings offer insightful guidance for future research on language model distillation as well as applications involving CoT. Codes are available at \url{https://github.com/xinchen9/cot_distillation_ACL2024}.

Learning to Maximize Mutual Information for Chain-of-Thought Distillation

TL;DR

and

while controlling information flow, the approach aligns CoT reasoning with predictive objectives, improving knowledge transfer from Large Language Models to smaller students. Empirical results on four datasets with T5-base and T5-small show consistent gains over standard finetuning, single-task, and the prior DSS method, along with calibrated predictions and higher-quality CoT outputs as assessed by GPT-4. The findings offer a principled direction for CoT-based distillation and suggest future integration of information-theoretic objectives in language-model compression.

Abstract

Paper Structure (22 sections, 11 equations, 5 figures, 8 tables)

This paper contains 22 sections, 11 equations, 5 figures, 8 tables.

Introduction
Related Work
Methodology
Preliminaries
Variational Bounds of MI
Training Loss
Experiments
Experimental Setting
Datasets.
Setup.
Baselines.
Evaluation Settings.
Results
Ablation Study
Effectiveness of Difference Dimension Reduction Method
...and 7 more sections

Figures (5)

Figure 1: Overview on our approach: distilling step by step from an IB perspective and measuring intrinsic relationship between the two tasks by MI. The DSS is a MTL framework pipeline consisting of predicting and explaining tasks. Besides prediction loss and explanation losses used in DSS, we design an auxiliary loss,named MI loss, which aims to maximize the mutual information between the two representation features. In turns, CoT reasoning capacity can be enhanced by task-oriented distillation.
Figure 2: Comparison with DSS with varying sizes of training datasets on T5-base model.
Figure 3: Comparison with DSS with varying sizes of training datasets on T5-small model.
Figure 4: A case study of the output rationale on SVAMP dataset.
Figure 5: A case study of the output rationale on e-SNLI dataset.

Learning to Maximize Mutual Information for Chain-of-Thought Distillation

TL;DR

Abstract

Learning to Maximize Mutual Information for Chain-of-Thought Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)