MeLo: Low-rank Adaptation is Better than Fine-tuning for Medical Image Diagnosis

Yitao Zhu; Zhenrong Shen; Zihao Zhao; Sheng Wang; Xin Wang; Xiangyu Zhao; Dinggang Shen; Qian Wang

MeLo: Low-rank Adaptation is Better than Fine-tuning for Medical Image Diagnosis

Yitao Zhu, Zhenrong Shen, Zihao Zhao, Sheng Wang, Xin Wang, Xiangyu Zhao, Dinggang Shen, Qian Wang

TL;DR

The paper tackles the challenge of deploying large Vision Transformer (ViT) models for medical image diagnosis under constraints of data, storage, and deployment latency. It introduces MeLo, a low-rank adaptation approach that freezes ViT weights and injects small $BA$ adapters into self-attention projections, achieving similar or better performance than full fine-tuning while using only about $0.17\%$ of trainable parameters. Across four diverse medical-imaging datasets and multiple ViT scales, MeLo maintains a tiny footprint (e.g., $\approx 0.14$M trainable parameters, scaling to ~1.22M for ViT-Giga) and enables rapid task switching with reduced memory and latency. This approach supports multi-task CAD with lightweight, plug-in modules, potentially accelerating access to robust medical foundation models with practical deployment advantages.

Abstract

The common practice in developing computer-aided diagnosis (CAD) models based on transformer architectures usually involves fine-tuning from ImageNet pre-trained weights. However, with recent advances in large-scale pre-training and the practice of scaling laws, Vision Transformers (ViT) have become much larger and less accessible to medical imaging communities. Additionally, in real-world scenarios, the deployments of multiple CAD models can be troublesome due to problems such as limited storage space and time-consuming model switching. To address these challenges, we propose a new method MeLo (Medical image Low-rank adaptation), which enables the development of a single CAD model for multiple clinical tasks in a lightweight manner. It adopts low-rank adaptation instead of resource-demanding fine-tuning. By fixing the weight of ViT models and only adding small low-rank plug-ins, we achieve competitive results on various diagnosis tasks across different imaging modalities using only a few trainable parameters. Specifically, our proposed method achieves comparable performance to fully fine-tuned ViT models on four distinct medical imaging datasets using about 0.17% trainable parameters. Moreover, MeLo adds only about 0.5MB of storage space and allows for extremely fast model switching in deployment and inference. Our source code and pre-trained weights are available on our website (https://absterzhu.github.io/melo.github.io/).

MeLo: Low-rank Adaptation is Better than Fine-tuning for Medical Image Diagnosis

TL;DR

adapters into self-attention projections, achieving similar or better performance than full fine-tuning while using only about

of trainable parameters. Across four diverse medical-imaging datasets and multiple ViT scales, MeLo maintains a tiny footprint (e.g.,

M trainable parameters, scaling to ~1.22M for ViT-Giga) and enables rapid task switching with reduced memory and latency. This approach supports multi-task CAD with lightweight, plug-in modules, potentially accelerating access to robust medical foundation models with practical deployment advantages.

Abstract

Paper Structure (10 sections, 1 equation, 3 figures, 2 tables)

This paper contains 10 sections, 1 equation, 3 figures, 2 tables.

Introduction
method
Medical Image Low-rank Adaptation (MeLo)
Datasets and Implementation Details
Experiments
Performance on Different Diagnosis Tasks
Performance on Different ViT Models
Performance on Deployment and Inference
CONCLUSION AND DISCUSSION
COMPLIANCE WITH ETHICAL STANDARDS

Figures (3)

Figure 1: The motivation of MeLo. The large-scale vision foundation model is just like a watermelon, and our proposed MeLo can conveniently adjust it to different clinical tasks by few additional parameters.
Figure 2: The illustration of our proposed MeLo. For a specific medical image diagnosis task, we inject low-rank decomposition matrices (denoted as $A$ and $B$) into the pre-trained query and value projection matrices (denoted as $W_Q$ and $W_V$) of each self-attention layer. Different module colors respond to different clinical tasks.
Figure 3: The AUC gradually increases as the ViT model size expands while the trainable parameters of corresponding MeLo modules remain consistently low.

MeLo: Low-rank Adaptation is Better than Fine-tuning for Medical Image Diagnosis

TL;DR

Abstract

MeLo: Low-rank Adaptation is Better than Fine-tuning for Medical Image Diagnosis

Authors

TL;DR

Abstract

Table of Contents

Figures (3)