Reprogramming Distillation for Medical Foundation Models
Yuhang Zhou, Siyuan Du, Haolin Li, Jiangchao Yao, Ya Zhang, Yanfeng Wang
TL;DR
This work tackles adapting medical foundation models to downstream tasks under modality and deployment constraints. It introduces Reprogramming Distillation (RD), which fixes the backbone and trains a reprogramming module and a shared classifier, complemented by co-training to align teacher and student decision boundaries and Centered Kernel Alignment distillation to stabilize transfer. RD consistently outperforms PEFT and KD baselines across five medical datasets and three foundation models, especially in data-scarce settings, while yielding a lightweight and customizable deployment. The approach reduces training overhead, preserves backbone privacy, and demonstrates strong generalization and efficiency for real-world medical applications.
Abstract
Medical foundation models pre-trained on large-scale datasets have demonstrated powerful versatile capabilities for various tasks. However, due to the gap between pre-training tasks (or modalities) and downstream tasks (or modalities), the real-world computation and speed constraints, it might not be straightforward to apply medical foundation models in the downstream scenarios. Previous methods, such as parameter efficient fine-tuning (PEFT) methods and knowledge distillation (KD) methods, are unable to simultaneously address the task (or modality) inconsistency and achieve personalized lightweight deployment under diverse real-world demands. To address the above issues, we propose a novel framework called Reprogramming Distillation (RD). On one hand, RD reprograms the original feature space of the foundation model so that it is more relevant to downstream scenarios, aligning tasks and modalities. On the other hand, through a co-training mechanism and a shared classifier, connections are established between the reprogrammed knowledge and the knowledge of student models, ensuring that the reprogrammed feature space can be smoothly mimic by the student model of different structures. Further, to reduce the randomness under different training conditions, we design a Centered Kernel Alignment (CKA) distillation to promote robust knowledge transfer. Empirically, we show that on extensive datasets, RD consistently achieve superior performance compared with previous PEFT and KD methods.
