Table of Contents
Fetching ...

A Large-scale Medical Visual Task Adaptation Benchmark

Shentong Mo, Xufang Luo, Yansen Wang, Dongsheng Li

TL;DR

Med-VTAB introduces a large-scale benchmark for medical visual task adaptation, addressing the lack of systematic evaluation across diverse medical imaging modalities. The authors propose GMoE-Adapter, a gated mixture-of-experts approach that fuses general-domain and medical-domain pre-trained weights to enhance adaptation of Vision Transformers. They study scaling laws of medical prompt tuning, generalizability across pre-training sources, and robustness to patient ID distribution shifts, reporting state-of-the-art results on color and multi-modal medical tasks. This benchmark and method collectively push toward scalable, robust, and clinically relevant medical image analysis with potential to improve diagnostic accuracy and cross-site transferability.

Abstract

Visual task adaptation has been demonstrated to be effective in adapting pre-trained Vision Transformers (ViTs) to general downstream visual tasks using specialized learnable layers or tokens. However, there is yet a large-scale benchmark to fully explore the effect of visual task adaptation on the realistic and important medical domain, particularly across diverse medical visual modalities, such as color images, X-ray, and CT. To close this gap, we present Med-VTAB, a large-scale Medical Visual Task Adaptation Benchmark consisting of 1.68 million medical images for diverse organs, modalities, and adaptation approaches. Based on Med-VTAB, we explore the scaling law of medical prompt tuning concerning tunable parameters and the generalizability of medical visual adaptation using non-medical/medical pre-train weights. Besides, we study the impact of patient ID out-of-distribution on medical visual adaptation, which is a real and challenging scenario. Furthermore, results from Med-VTAB indicate that a single pre-trained model falls short in medical task adaptation. Therefore, we introduce GMoE-Adapter, a novel method that combines medical and general pre-training weights through a gated mixture-of-experts adapter, achieving state-of-the-art results in medical visual task adaptation.

A Large-scale Medical Visual Task Adaptation Benchmark

TL;DR

Med-VTAB introduces a large-scale benchmark for medical visual task adaptation, addressing the lack of systematic evaluation across diverse medical imaging modalities. The authors propose GMoE-Adapter, a gated mixture-of-experts approach that fuses general-domain and medical-domain pre-trained weights to enhance adaptation of Vision Transformers. They study scaling laws of medical prompt tuning, generalizability across pre-training sources, and robustness to patient ID distribution shifts, reporting state-of-the-art results on color and multi-modal medical tasks. This benchmark and method collectively push toward scalable, robust, and clinically relevant medical image analysis with potential to improve diagnostic accuracy and cross-site transferability.

Abstract

Visual task adaptation has been demonstrated to be effective in adapting pre-trained Vision Transformers (ViTs) to general downstream visual tasks using specialized learnable layers or tokens. However, there is yet a large-scale benchmark to fully explore the effect of visual task adaptation on the realistic and important medical domain, particularly across diverse medical visual modalities, such as color images, X-ray, and CT. To close this gap, we present Med-VTAB, a large-scale Medical Visual Task Adaptation Benchmark consisting of 1.68 million medical images for diverse organs, modalities, and adaptation approaches. Based on Med-VTAB, we explore the scaling law of medical prompt tuning concerning tunable parameters and the generalizability of medical visual adaptation using non-medical/medical pre-train weights. Besides, we study the impact of patient ID out-of-distribution on medical visual adaptation, which is a real and challenging scenario. Furthermore, results from Med-VTAB indicate that a single pre-trained model falls short in medical task adaptation. Therefore, we introduce GMoE-Adapter, a novel method that combines medical and general pre-training weights through a gated mixture-of-experts adapter, achieving state-of-the-art results in medical visual task adaptation.
Paper Structure (35 sections, 3 figures, 10 tables)

This paper contains 35 sections, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Med-VTAB is a large-scale benchmark for adaptation on medical images, consisting of 1.68 million samples, 10 rich organs, and 5 challenging modalities in real-world medical scenarios. Med-VTAB presents new challenges for impactful adaptation approaches involving full fine-tune, head-oriented (e.g. linear probing, partial), backbone-oriented (e.g. adapter) and prompt-oriented (e.g. VPT) adaptation approaches using pre-trained models from general and medical domains.
  • Figure 2: Statistics of organ diversity (Left, 10 rich organs) and modality diversity (Right, 5 challenging modalities) in the proposed Medical Visual Task Adaptation Benchmark (Med-VTAB) consisting of 1.68 million images.
  • Figure 3: Illustration of the proposed Gated Mixture-of-Experts (GMoE)-Adapter framework versus the standard adapter and MoE-adapter methods.