Table of Contents
Fetching ...

Reprogramming Distillation for Medical Foundation Models

Yuhang Zhou, Siyuan Du, Haolin Li, Jiangchao Yao, Ya Zhang, Yanfeng Wang

TL;DR

This work tackles adapting medical foundation models to downstream tasks under modality and deployment constraints. It introduces Reprogramming Distillation (RD), which fixes the backbone and trains a reprogramming module and a shared classifier, complemented by co-training to align teacher and student decision boundaries and Centered Kernel Alignment distillation to stabilize transfer. RD consistently outperforms PEFT and KD baselines across five medical datasets and three foundation models, especially in data-scarce settings, while yielding a lightweight and customizable deployment. The approach reduces training overhead, preserves backbone privacy, and demonstrates strong generalization and efficiency for real-world medical applications.

Abstract

Medical foundation models pre-trained on large-scale datasets have demonstrated powerful versatile capabilities for various tasks. However, due to the gap between pre-training tasks (or modalities) and downstream tasks (or modalities), the real-world computation and speed constraints, it might not be straightforward to apply medical foundation models in the downstream scenarios. Previous methods, such as parameter efficient fine-tuning (PEFT) methods and knowledge distillation (KD) methods, are unable to simultaneously address the task (or modality) inconsistency and achieve personalized lightweight deployment under diverse real-world demands. To address the above issues, we propose a novel framework called Reprogramming Distillation (RD). On one hand, RD reprograms the original feature space of the foundation model so that it is more relevant to downstream scenarios, aligning tasks and modalities. On the other hand, through a co-training mechanism and a shared classifier, connections are established between the reprogrammed knowledge and the knowledge of student models, ensuring that the reprogrammed feature space can be smoothly mimic by the student model of different structures. Further, to reduce the randomness under different training conditions, we design a Centered Kernel Alignment (CKA) distillation to promote robust knowledge transfer. Empirically, we show that on extensive datasets, RD consistently achieve superior performance compared with previous PEFT and KD methods.

Reprogramming Distillation for Medical Foundation Models

TL;DR

This work tackles adapting medical foundation models to downstream tasks under modality and deployment constraints. It introduces Reprogramming Distillation (RD), which fixes the backbone and trains a reprogramming module and a shared classifier, complemented by co-training to align teacher and student decision boundaries and Centered Kernel Alignment distillation to stabilize transfer. RD consistently outperforms PEFT and KD baselines across five medical datasets and three foundation models, especially in data-scarce settings, while yielding a lightweight and customizable deployment. The approach reduces training overhead, preserves backbone privacy, and demonstrates strong generalization and efficiency for real-world medical applications.

Abstract

Medical foundation models pre-trained on large-scale datasets have demonstrated powerful versatile capabilities for various tasks. However, due to the gap between pre-training tasks (or modalities) and downstream tasks (or modalities), the real-world computation and speed constraints, it might not be straightforward to apply medical foundation models in the downstream scenarios. Previous methods, such as parameter efficient fine-tuning (PEFT) methods and knowledge distillation (KD) methods, are unable to simultaneously address the task (or modality) inconsistency and achieve personalized lightweight deployment under diverse real-world demands. To address the above issues, we propose a novel framework called Reprogramming Distillation (RD). On one hand, RD reprograms the original feature space of the foundation model so that it is more relevant to downstream scenarios, aligning tasks and modalities. On the other hand, through a co-training mechanism and a shared classifier, connections are established between the reprogrammed knowledge and the knowledge of student models, ensuring that the reprogrammed feature space can be smoothly mimic by the student model of different structures. Further, to reduce the randomness under different training conditions, we design a Centered Kernel Alignment (CKA) distillation to promote robust knowledge transfer. Empirically, we show that on extensive datasets, RD consistently achieve superior performance compared with previous PEFT and KD methods.
Paper Structure (14 sections, 3 equations, 2 figures, 6 tables)

This paper contains 14 sections, 3 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: The overview of RD. During training, only the foundation model is fixed.
  • Figure 2: Comparison of decision boundaries of different methods. Our method has decision boundaries that are more similar to the foundation model compared to other methods, which could be a reason for the better performance of ours.