PANDA: Prompt Transfer Meets Knowledge Distillation for Efficient Model Adaptation
Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, Dacheng Tao
TL;DR
This work tackles the limitations of prompt-tuning via vanilla prompt transfer by introducing PANDA, a KD-enhanced PoT framework guided by a novel prompt transferability metric. The metric maps source and target tasks into a shared semantic space using soft prompts and computes the transferability as the cosine similarity of task embeddings, enabling adaptive, distance-aware knowledge transfer. PanDa uses a target-like teacher formed from the source prompt on the target task and a student with a random prompt, training with both ground-truth supervision and distillation, weighted by the predicted transferability; it also extends to multi-task transfer with early- and late-fusion strategies. Across 189 source-target pairs and 5 PLM scales, PanDa consistently outperforms vanilla PoT by $2.3\%$ on average and up to $24.1\%$ on some tasks, with prompt-tuning achieving competitive or superior results to full-model-tuning in several regimes. The approach demonstrates broad applicability to various PLMs and multi-task settings, and provides a principled direction for more robust, efficient transfer in NLP models.
Abstract
Prompt Transfer (PoT) is a recently-proposed approach to improve prompt-tuning, by initializing the target prompt with the existing prompt trained on similar source tasks. However, such a vanilla PoT approach usually achieves sub-optimal performance, as (i) the PoT is sensitive to the similarity of source-target pair and (ii) directly fine-tuning the prompt initialized with source prompt on target task might lead to forgetting of the useful general knowledge learned from source task. To tackle these issues, we propose a new metric to accurately predict the prompt transferability (regarding (i)), and a novel PoT approach (namely PANDA) that leverages the knowledge distillation technique to alleviate the knowledge forgetting effectively (regarding (ii)). Extensive and systematic experiments on 189 combinations of 21 source and 9 target datasets across 5 scales of PLMs demonstrate that: 1) our proposed metric works well to predict the prompt transferability; 2) our PANDA consistently outperforms the vanilla PoT approach by 2.3% average score (up to 24.1%) among all tasks and model sizes; 3) with our PANDA approach, prompt-tuning can achieve competitive and even better performance than model-tuning in various PLM scales scenarios. We have publicly released our code in https://github.com/WHU-ZQH/PANDA.
