PANDA: Prompt Transfer Meets Knowledge Distillation for Efficient Model Adaptation

Qihuang Zhong; Liang Ding; Juhua Liu; Bo Du; Dacheng Tao

PANDA: Prompt Transfer Meets Knowledge Distillation for Efficient Model Adaptation

Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, Dacheng Tao

TL;DR

This work tackles the limitations of prompt-tuning via vanilla prompt transfer by introducing PANDA, a KD-enhanced PoT framework guided by a novel prompt transferability metric. The metric maps source and target tasks into a shared semantic space using soft prompts and computes the transferability as the cosine similarity of task embeddings, enabling adaptive, distance-aware knowledge transfer. PanDa uses a target-like teacher formed from the source prompt on the target task and a student with a random prompt, training with both ground-truth supervision and distillation, weighted by the predicted transferability; it also extends to multi-task transfer with early- and late-fusion strategies. Across 189 source-target pairs and 5 PLM scales, PanDa consistently outperforms vanilla PoT by $2.3\%$ on average and up to $24.1\%$ on some tasks, with prompt-tuning achieving competitive or superior results to full-model-tuning in several regimes. The approach demonstrates broad applicability to various PLMs and multi-task settings, and provides a principled direction for more robust, efficient transfer in NLP models.

Abstract

Prompt Transfer (PoT) is a recently-proposed approach to improve prompt-tuning, by initializing the target prompt with the existing prompt trained on similar source tasks. However, such a vanilla PoT approach usually achieves sub-optimal performance, as (i) the PoT is sensitive to the similarity of source-target pair and (ii) directly fine-tuning the prompt initialized with source prompt on target task might lead to forgetting of the useful general knowledge learned from source task. To tackle these issues, we propose a new metric to accurately predict the prompt transferability (regarding (i)), and a novel PoT approach (namely PANDA) that leverages the knowledge distillation technique to alleviate the knowledge forgetting effectively (regarding (ii)). Extensive and systematic experiments on 189 combinations of 21 source and 9 target datasets across 5 scales of PLMs demonstrate that: 1) our proposed metric works well to predict the prompt transferability; 2) our PANDA consistently outperforms the vanilla PoT approach by 2.3% average score (up to 24.1%) among all tasks and model sizes; 3) with our PANDA approach, prompt-tuning can achieve competitive and even better performance than model-tuning in various PLM scales scenarios. We have publicly released our code in https://github.com/WHU-ZQH/PANDA.

PANDA: Prompt Transfer Meets Knowledge Distillation for Efficient Model Adaptation

TL;DR

on average and up to

on some tasks, with prompt-tuning achieving competitive or superior results to full-model-tuning in several regimes. The approach demonstrates broad applicability to various PLMs and multi-task settings, and provides a principled direction for more robust, efficient transfer in NLP models.

Abstract

Paper Structure (40 sections, 7 equations, 9 figures, 16 tables)

This paper contains 40 sections, 7 equations, 9 figures, 16 tables.

Introduction
Related Works
Pretrained Language Model
Prompt-tuning for PLMs
Knowledge Distillation
Method
Preliminaries
Prompt transfer
Proposed Methods
Prompt Transferability Metric
PanDa Approach
Expanding to multi-task settings
Experiments
Tasks and Datasets
Implementation Details
...and 25 more sections

Figures (9)

Figure 1: Average performances on parts of SuperGLUE and GLUE benchmarks. Note that best performances of SPoT and our PanDa are reported. Our PanDa approach outperforms the vanilla prompt transfer approach (SPoT vu2021spot) across all model sizes. Additionally, within our PanDa approach, prompt-tuning method can obtain competitive or better performances than model-tuning methods in the full-shot scenario.
Figure 2: Left: Comparisons between model-tuning and prompt-tuning across various model sizes. Here, we report the average performance among all 9 target tasks (as stated in Table \ref{['tab:dataset_details']}).Right: Sensitivity analysis of prompt-tuning on the initialization of prompt, where the normal/sparse/constant (all zeros) initialization is used. BERT-large is used in this setting.
Figure 3: Left: An illustration of vanilla PoT. Right: The architecture of our proposed PANDA. Notably, we can first train the teacher network on the target task with fewer iterations and obtain the new teacher network (target-like network), but we do not show the procedure for ease of illustration.
Figure 4: Left: results of different $\lambda$ on BERT-small. Medium: results of different $\lambda$ on BERT-medium. Right: results of different $\lambda$ on BERT-tiny.
Figure 5: From Left to Right: 1) a heatmap of our predicted prompt transferability across all 21 tasks; 2): results predicted by the method "ON" in prior work su2021transferability; 3): results predicted by the method "E$_{avg}$" in the SPoT vu2021spot; 4): Spearman's correlation scores of these metrics with cross-task (21 source tasks $\rightarrow$ 9 target tasks) prompt transfer performance across all model sizes. Note that BERT-large is used in the left three sub-figures, and the results of more models are shown in Appendix.
...and 4 more figures

PANDA: Prompt Transfer Meets Knowledge Distillation for Efficient Model Adaptation

TL;DR

Abstract

PANDA: Prompt Transfer Meets Knowledge Distillation for Efficient Model Adaptation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)