MeLo: Low-rank Adaptation is Better than Fine-tuning for Medical Image Diagnosis
Yitao Zhu, Zhenrong Shen, Zihao Zhao, Sheng Wang, Xin Wang, Xiangyu Zhao, Dinggang Shen, Qian Wang
TL;DR
The paper tackles the challenge of deploying large Vision Transformer (ViT) models for medical image diagnosis under constraints of data, storage, and deployment latency. It introduces MeLo, a low-rank adaptation approach that freezes ViT weights and injects small $BA$ adapters into self-attention projections, achieving similar or better performance than full fine-tuning while using only about $0.17\%$ of trainable parameters. Across four diverse medical-imaging datasets and multiple ViT scales, MeLo maintains a tiny footprint (e.g., $\approx 0.14$M trainable parameters, scaling to ~1.22M for ViT-Giga) and enables rapid task switching with reduced memory and latency. This approach supports multi-task CAD with lightweight, plug-in modules, potentially accelerating access to robust medical foundation models with practical deployment advantages.
Abstract
The common practice in developing computer-aided diagnosis (CAD) models based on transformer architectures usually involves fine-tuning from ImageNet pre-trained weights. However, with recent advances in large-scale pre-training and the practice of scaling laws, Vision Transformers (ViT) have become much larger and less accessible to medical imaging communities. Additionally, in real-world scenarios, the deployments of multiple CAD models can be troublesome due to problems such as limited storage space and time-consuming model switching. To address these challenges, we propose a new method MeLo (Medical image Low-rank adaptation), which enables the development of a single CAD model for multiple clinical tasks in a lightweight manner. It adopts low-rank adaptation instead of resource-demanding fine-tuning. By fixing the weight of ViT models and only adding small low-rank plug-ins, we achieve competitive results on various diagnosis tasks across different imaging modalities using only a few trainable parameters. Specifically, our proposed method achieves comparable performance to fully fine-tuned ViT models on four distinct medical imaging datasets using about 0.17% trainable parameters. Moreover, MeLo adds only about 0.5MB of storage space and allows for extremely fast model switching in deployment and inference. Our source code and pre-trained weights are available on our website (https://absterzhu.github.io/melo.github.io/).
