Table of Contents
Fetching ...

Efficient Knowledge Distillation: Empowering Small Language Models with Teacher Model Insights

Mohamad Ballout, Ulf Krumnack, Gunther Heidemann, Kai-Uwe Kühnberger

TL;DR

The paper tackles the challenge of deploying small language models by introducing a gradient-based knowledge distillation method that uses a ~3B teacher to identify influential input tokens via saliency. These top tokens are provided as rationales to a student model, while the student is trained with a dual objective to predict the label and generate the rationales, using a lightweight setup with a fixed rationale weight. Experiments on four diverse datasets show improvements over standard fine-tuning and competitive results against a very large-step distillation approach, with a sweet spot of around five tokens for rationales. The work also analyzes the efficacy of token selections, the similarity between models, and the limitations imposed by the remaining performance gap between teacher and student, highlighting practical implications for deploying efficient LMs on resource-constrained platforms.

Abstract

Enhancing small language models for real-life application deployment is a significant challenge facing the research community. Due to the difficulties and costs of using large language models, researchers are seeking ways to effectively deploy task-specific small models. In this work, we introduce a simple yet effective knowledge distillation method to improve the performance of small language models. Our approach utilizes a teacher model with approximately 3 billion parameters to identify the most influential tokens in its decision-making process. These tokens are extracted from the input based on their attribution scores relative to the output, using methods like saliency maps. These important tokens are then provided as rationales to a student model, aiming to distill the knowledge of the teacher model. This method has proven to be effective, as demonstrated by testing it on four diverse datasets, where it shows improvement over both standard fine-tuning methods and state-of-the-art knowledge distillation models. Furthermore, we explore explanations of the success of the model by analyzing the important tokens extracted from the teacher model. Our findings reveal that in 68\% of cases, specifically in datasets where labels are part of the answer, such as multiple-choice questions, the extracted tokens are part of the ground truth.

Efficient Knowledge Distillation: Empowering Small Language Models with Teacher Model Insights

TL;DR

The paper tackles the challenge of deploying small language models by introducing a gradient-based knowledge distillation method that uses a ~3B teacher to identify influential input tokens via saliency. These top tokens are provided as rationales to a student model, while the student is trained with a dual objective to predict the label and generate the rationales, using a lightweight setup with a fixed rationale weight. Experiments on four diverse datasets show improvements over standard fine-tuning and competitive results against a very large-step distillation approach, with a sweet spot of around five tokens for rationales. The work also analyzes the efficacy of token selections, the similarity between models, and the limitations imposed by the remaining performance gap between teacher and student, highlighting practical implications for deploying efficient LMs on resource-constrained platforms.

Abstract

Enhancing small language models for real-life application deployment is a significant challenge facing the research community. Due to the difficulties and costs of using large language models, researchers are seeking ways to effectively deploy task-specific small models. In this work, we introduce a simple yet effective knowledge distillation method to improve the performance of small language models. Our approach utilizes a teacher model with approximately 3 billion parameters to identify the most influential tokens in its decision-making process. These tokens are extracted from the input based on their attribution scores relative to the output, using methods like saliency maps. These important tokens are then provided as rationales to a student model, aiming to distill the knowledge of the teacher model. This method has proven to be effective, as demonstrated by testing it on four diverse datasets, where it shows improvement over both standard fine-tuning methods and state-of-the-art knowledge distillation models. Furthermore, we explore explanations of the success of the model by analyzing the important tokens extracted from the teacher model. Our findings reveal that in 68\% of cases, specifically in datasets where labels are part of the answer, such as multiple-choice questions, the extracted tokens are part of the ground truth.
Paper Structure (17 sections, 6 equations, 3 figures, 6 tables)

This paper contains 17 sections, 6 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The figure displays the performance of our proposed method across four diverse datasets: e-SNLI esnli and ANLI anli, which involve natural language inference; CQA cqa1cqa2 focusing on commonsense question answering; and SVAMP svamp, which deals with arithmetic word problems. We compare our method to both the standard fine-tuning method and the step-by-step distillation method.
  • Figure 2: The figure illustrates the pipeline of our proposed method, where tokens with the highest attribution scores, extracted from the teacher model, are used as rationales. The student model is then trained to generate these rationales, along with the original labels.
  • Figure 3: The figure shows the performance of the model versus the number of tokens provided as rationales