Table of Contents
Fetching ...

Task-Specific Knowledge Distillation from the Vision Foundation Model for Enhanced Medical Image Segmentation

Pengchen Liang, Haishan Huang, Bin Pu, Jianguo Chen, Xiang Hua, Jing Zhang, Weibo Ma, Zhuangzhuang Chen, Yiwei Li, Qing Chang

TL;DR

The paper tackles data-scarce medical image segmentation by distilling task-specific knowledge from a Vision Foundation Model. It introduces TS-KD, which first fine-tunes a large SAM model via LoRA on the target segmentation task, then distills task-relevant representations and outputs to a compact ViT-Tiny student, augmented with diffusion-generated synthetic transfer data. The approach yields consistent gains over task-agnostic KD and self-supervised methods across five diverse datasets, with notable improvements in Dice scores and boundary precision under limited labeled data. The work also provides theoretical insights into improved generalization and analyzes practical aspects like transfer-set size, LoRA-rank choices, and training efficiency, highlighting TS-KD as a scalable path for deploying accurate medical segmentation models in resource-constrained environments.

Abstract

Large-scale pre-trained models, such as Vision Foundation Models (VFMs), have demonstrated impressive performance across various downstream tasks by transferring generalized knowledge, especially when target data is limited. However, their high computational cost and the domain gap between natural and medical images limit their practical application in medical segmentation tasks. Motivated by this, we pose the following important question: "How can we effectively utilize the knowledge of large pre-trained VFMs to train a small, task-specific model for medical image segmentation when training data is limited?" To address this problem, we propose a novel and generalizable task-specific knowledge distillation framework. Our method fine-tunes the VFM on the target segmentation task to capture task-specific features before distilling the knowledge to smaller models, leveraging Low-Rank Adaptation (LoRA) to reduce the computational cost of fine-tuning. Additionally, we incorporate synthetic data generated by diffusion models to augment the transfer set, enhancing model performance in data-limited scenarios. Experimental results across five medical image datasets demonstrate that our method consistently outperforms task-agnostic knowledge distillation and self-supervised pretraining approaches like MoCo v3 and Masked Autoencoders (MAE). For example, on the KidneyUS dataset, our method achieved a 28% higher Dice score than task-agnostic KD using 80 labeled samples for fine-tuning. On the CHAOS dataset, it achieved an 11% improvement over MAE with 100 labeled samples. These results underscore the potential of task-specific knowledge distillation to train accurate, efficient models for medical image segmentation in data-constrained settings.

Task-Specific Knowledge Distillation from the Vision Foundation Model for Enhanced Medical Image Segmentation

TL;DR

The paper tackles data-scarce medical image segmentation by distilling task-specific knowledge from a Vision Foundation Model. It introduces TS-KD, which first fine-tunes a large SAM model via LoRA on the target segmentation task, then distills task-relevant representations and outputs to a compact ViT-Tiny student, augmented with diffusion-generated synthetic transfer data. The approach yields consistent gains over task-agnostic KD and self-supervised methods across five diverse datasets, with notable improvements in Dice scores and boundary precision under limited labeled data. The work also provides theoretical insights into improved generalization and analyzes practical aspects like transfer-set size, LoRA-rank choices, and training efficiency, highlighting TS-KD as a scalable path for deploying accurate medical segmentation models in resource-constrained environments.

Abstract

Large-scale pre-trained models, such as Vision Foundation Models (VFMs), have demonstrated impressive performance across various downstream tasks by transferring generalized knowledge, especially when target data is limited. However, their high computational cost and the domain gap between natural and medical images limit their practical application in medical segmentation tasks. Motivated by this, we pose the following important question: "How can we effectively utilize the knowledge of large pre-trained VFMs to train a small, task-specific model for medical image segmentation when training data is limited?" To address this problem, we propose a novel and generalizable task-specific knowledge distillation framework. Our method fine-tunes the VFM on the target segmentation task to capture task-specific features before distilling the knowledge to smaller models, leveraging Low-Rank Adaptation (LoRA) to reduce the computational cost of fine-tuning. Additionally, we incorporate synthetic data generated by diffusion models to augment the transfer set, enhancing model performance in data-limited scenarios. Experimental results across five medical image datasets demonstrate that our method consistently outperforms task-agnostic knowledge distillation and self-supervised pretraining approaches like MoCo v3 and Masked Autoencoders (MAE). For example, on the KidneyUS dataset, our method achieved a 28% higher Dice score than task-agnostic KD using 80 labeled samples for fine-tuning. On the CHAOS dataset, it achieved an 11% improvement over MAE with 100 labeled samples. These results underscore the potential of task-specific knowledge distillation to train accurate, efficient models for medical image segmentation in data-constrained settings.

Paper Structure

This paper contains 45 sections, 13 equations, 10 figures, 16 tables.

Figures (10)

  • Figure 1: Overview of the proposed task-specific knowledge distillation framework compared with the task-agnostic method. Top: Transfer set curation using a diffusion model to generate synthetic data from a small labeled dataset, expanding the training set. Middle:Task-agnostic knowledge distillation method: (a) The small Vision Transformer (ViT) model is pre-trained by matching its features to those extracted by the VFM on the unlabeled transfer set, focusing on general feature alignment. (b) The small ViT is fine-tuned using the labeled target task data. Bottom:Proposed task-specific knowledge distillation method: (c) The VFM is first fine-tuned on the target task using LoRA adaptation. (d) The small ViT model is pre-trained by aligning hidden layer representations and segmentation-specific predictions between the VFM and the small ViT on the unlabeled transfer set. (e) The small ViT model is fine-tuned using the labeled target task data to optimize segmentation performance.
  • Figure 2: Efficient Task-Specific Knowledge Distillation framework. (a) Fine-tuning large models for specific medical segmentation tasks, such as kidney ultrasound, melanoma, and retinal vessel segmentation, using the Vision Foundation Model ('Segment Anything Model'). The framework utilizes LoRA fine-tuning to adapt VFM's encoder and decoder for these specific tasks. Transfer datasets are used in combination with knowledge distillation to train smaller task-specific models for efficient deployment. (b) Detailed architecture of LoRA fine-tuning, showing the VFM transformer's block with LoRA layers for efficient parameter updates, where pre-trained weights are kept frozen while low-rank matrices adjust the attention mechanism to fit the target task.
  • Figure 3: Comparison of Dice scores across five medical imaging datasets (KidneyUS, Autooral, CHAOS, PH2, and DRIVE) using different pretraining and knowledge distillation methods. Each plot (a)-(e) represents a dataset, with results shown for models fine-tuned with varying numbers of labeled samples. Approaches include a Small ViT model trained from scratch, ImageNet pre-trained Small ViT model with MAE, self-supervised pretraining with MoCo v3 and MAE, task-agnostic knowledge distillation (KD), and task-specific KD (TS-KD). Task-specific KD consistently demonstrates the highest performance trends and superior Dice scores across all datasets, showcasing its effectiveness in leveraging task-relevant features for medical image segmentation.
  • Figure 4: Qualitative comparison of segmentation results for a small Vision Transformer (ViT) model fine-tuned using different pretraining strategies across five medical image datasets (KidneyUS, Autooral, CHAOS, PH2, and DRIVE). Rows represent various pretraining and knowledge distillation strategies: Random Initialization, MAE-Pretrain (ImageNet Datasets), MAE-Pretrain (Transfer Datasets), MOCOV3-Pretrain (Transfer Datasets), Task-Agnostic KD (Transfer Datasets), and Task-Specific KD (Transfer Datasets). Each column corresponds to a dataset, with segmentation results visualized for representative samples. Task-Specific KD consistently produces segmentations that are more accurate and closely aligned with ground truth, demonstrating its effectiveness in leveraging task-specific features in medical image segmentation.
  • Figure 5: Segmentation performance on the KidneyUS dataset across different pre-training and knowledge distillation methods. (a) Dice score trends for self-supervised pretraining with MoCo v3 using transfer datasets of varying sizes (1000, 2000, and 3000 images). (b) Dice score trends for self-supervised pretraining with MAE using different transfer dataset sizes. (c) Dice score trends for task-agnostic KD (SAM) across different transfer dataset sizes and numbers of labeled samples used for fine-tuning. (d) Dice score trends for task-specific KD (SAM) show superior performance across all labeled sample sizes and transfer dataset scales, demonstrating the most significant performance improvements when larger transfer datasets are used.
  • ...and 5 more figures