Table of Contents
Fetching ...

PromptKD: Distilling Student-Friendly Knowledge for Generative Language Models via Prompt Tuning

Gyeongman Kim, Doohyuk Jang, Eunho Yang

Abstract

Recent advancements in large language models (LLMs) have raised concerns about inference costs, increasing the need for research into model compression. While knowledge distillation (KD) is a prominent method for this, research on KD for generative language models like LLMs is relatively sparse, and the approach of distilling student-friendly knowledge, which has shown promising performance in KD for classification models, remains unexplored in generative language models. To explore this approach, we propose PromptKD, a simple yet effective method that utilizes prompt tuning - for the first time in KD - to enable generative language models to transfer student-friendly knowledge. Unlike previous works in classification that require fine-tuning the entire teacher model for extracting student-friendly knowledge, PromptKD achieves similar effects by adding a small number of prompt tokens and tuning only the prompt with student guidance. Extensive experiments on instruction-following datasets show that PromptKD achieves state-of-the-art performance while adding only 0.0007% of the teacher's parameters as prompts. Further analysis suggests that distilling student-friendly knowledge alleviates exposure bias effectively throughout the entire training process, leading to performance enhancements.

PromptKD: Distilling Student-Friendly Knowledge for Generative Language Models via Prompt Tuning

Abstract

Recent advancements in large language models (LLMs) have raised concerns about inference costs, increasing the need for research into model compression. While knowledge distillation (KD) is a prominent method for this, research on KD for generative language models like LLMs is relatively sparse, and the approach of distilling student-friendly knowledge, which has shown promising performance in KD for classification models, remains unexplored in generative language models. To explore this approach, we propose PromptKD, a simple yet effective method that utilizes prompt tuning - for the first time in KD - to enable generative language models to transfer student-friendly knowledge. Unlike previous works in classification that require fine-tuning the entire teacher model for extracting student-friendly knowledge, PromptKD achieves similar effects by adding a small number of prompt tokens and tuning only the prompt with student guidance. Extensive experiments on instruction-following datasets show that PromptKD achieves state-of-the-art performance while adding only 0.0007% of the teacher's parameters as prompts. Further analysis suggests that distilling student-friendly knowledge alleviates exposure bias effectively throughout the entire training process, leading to performance enhancements.
Paper Structure (29 sections, 3 equations, 4 figures, 12 tables, 1 algorithm)

This paper contains 29 sections, 3 equations, 4 figures, 12 tables, 1 algorithm.

Figures (4)

  • Figure 1: Comparison of instruction-following performance of KD methods using the GPT-2 model family. Owing to the student-friendly knowledge, our PromptKD outperforms others with only an additional 11K parameters. Dashed reference line represents the performance of the teacher model.
  • Figure 2: Training procedure of PromptKD. To mitigate exposure bias, responses are generated by the student to be used as pseudo-targets. Then, for adaptive teaching, the prompt input to the teacher is trained based on guidance from the student. During this process, regularization loss is also employed to address instability stemming from the prompt. Lastly, teacher distills student-friendly knowledge to the student using the trained prompt.
  • Figure 3: The measurement of exposure bias. Excess accumulated error (ExAccErr) is measured with respect to generation steps and training progress, where values closer to 0 indicate alleviation of exposure bias.
  • Figure 4: Ablation on prompt settings. To validate the impact of prompt initialization method and length, we evaluate the average ROUGE-L score over varying these settings.