MDViT: Multi-domain Vision Transformer for Small Medical Image Segmentation Datasets
Siyi Du, Nourhan Bayasi, Ghassan Hamarneh, Rafeef Garbi
TL;DR
MDViT tackles the data-hungry nature of vision transformers for small medical image segmentation datasets by introducing a fixed-size multi-domain ViT equipped with domain adapters to mitigate negative knowledge transfer and a mutual knowledge distillation framework that transfers knowledge between a universal network and domain-specific peers. The domain adapters enable domain-aware attention within the MHSA, while MKD promotes robust, shared representation learning across domains. Evaluated on four skin lesion segmentation datasets, MDViT outperforms separate and joint training schemes and state-of-the-art data-efficient MIS ViTs, achieving notable gains such as a 10.16% improvement in IOU on SCD and robust performance as more domains are added. The approach offers practical benefits for deploying MIS models across diverse, smaller datasets with fixed model size at inference and demonstrates the plug-in applicability of domain adapters to other ViTs.
Abstract
Despite its clinical utility, medical image segmentation (MIS) remains a daunting task due to images' inherent complexity and variability. Vision transformers (ViTs) have recently emerged as a promising solution to improve MIS; however, they require larger training datasets than convolutional neural networks. To overcome this obstacle, data-efficient ViTs were proposed, but they are typically trained using a single source of data, which overlooks the valuable knowledge that could be leveraged from other available datasets. Naivly combining datasets from different domains can result in negative knowledge transfer (NKT), i.e., a decrease in model performance on some domains with non-negligible inter-domain heterogeneity. In this paper, we propose MDViT, the first multi-domain ViT that includes domain adapters to mitigate data-hunger and combat NKT by adaptively exploiting knowledge in multiple small data resources (domains). Further, to enhance representation learning across domains, we integrate a mutual knowledge distillation paradigm that transfers knowledge between a universal network (spanning all the domains) and auxiliary domain-specific branches. Experiments on 4 skin lesion segmentation datasets show that MDViT outperforms state-of-the-art algorithms, with superior segmentation performance and a fixed model size, at inference time, even as more domains are added. Our code is available at https://github.com/siyi-wind/MDViT.
