Table of Contents
Fetching ...

MeLo: Low-rank Adaptation is Better than Fine-tuning for Medical Image Diagnosis

Yitao Zhu, Zhenrong Shen, Zihao Zhao, Sheng Wang, Xin Wang, Xiangyu Zhao, Dinggang Shen, Qian Wang

TL;DR

The paper tackles the challenge of deploying large Vision Transformer (ViT) models for medical image diagnosis under constraints of data, storage, and deployment latency. It introduces MeLo, a low-rank adaptation approach that freezes ViT weights and injects small $BA$ adapters into self-attention projections, achieving similar or better performance than full fine-tuning while using only about $0.17\%$ of trainable parameters. Across four diverse medical-imaging datasets and multiple ViT scales, MeLo maintains a tiny footprint (e.g., $\approx 0.14$M trainable parameters, scaling to ~1.22M for ViT-Giga) and enables rapid task switching with reduced memory and latency. This approach supports multi-task CAD with lightweight, plug-in modules, potentially accelerating access to robust medical foundation models with practical deployment advantages.

Abstract

The common practice in developing computer-aided diagnosis (CAD) models based on transformer architectures usually involves fine-tuning from ImageNet pre-trained weights. However, with recent advances in large-scale pre-training and the practice of scaling laws, Vision Transformers (ViT) have become much larger and less accessible to medical imaging communities. Additionally, in real-world scenarios, the deployments of multiple CAD models can be troublesome due to problems such as limited storage space and time-consuming model switching. To address these challenges, we propose a new method MeLo (Medical image Low-rank adaptation), which enables the development of a single CAD model for multiple clinical tasks in a lightweight manner. It adopts low-rank adaptation instead of resource-demanding fine-tuning. By fixing the weight of ViT models and only adding small low-rank plug-ins, we achieve competitive results on various diagnosis tasks across different imaging modalities using only a few trainable parameters. Specifically, our proposed method achieves comparable performance to fully fine-tuned ViT models on four distinct medical imaging datasets using about 0.17% trainable parameters. Moreover, MeLo adds only about 0.5MB of storage space and allows for extremely fast model switching in deployment and inference. Our source code and pre-trained weights are available on our website (https://absterzhu.github.io/melo.github.io/).

MeLo: Low-rank Adaptation is Better than Fine-tuning for Medical Image Diagnosis

TL;DR

The paper tackles the challenge of deploying large Vision Transformer (ViT) models for medical image diagnosis under constraints of data, storage, and deployment latency. It introduces MeLo, a low-rank adaptation approach that freezes ViT weights and injects small adapters into self-attention projections, achieving similar or better performance than full fine-tuning while using only about of trainable parameters. Across four diverse medical-imaging datasets and multiple ViT scales, MeLo maintains a tiny footprint (e.g., M trainable parameters, scaling to ~1.22M for ViT-Giga) and enables rapid task switching with reduced memory and latency. This approach supports multi-task CAD with lightweight, plug-in modules, potentially accelerating access to robust medical foundation models with practical deployment advantages.

Abstract

The common practice in developing computer-aided diagnosis (CAD) models based on transformer architectures usually involves fine-tuning from ImageNet pre-trained weights. However, with recent advances in large-scale pre-training and the practice of scaling laws, Vision Transformers (ViT) have become much larger and less accessible to medical imaging communities. Additionally, in real-world scenarios, the deployments of multiple CAD models can be troublesome due to problems such as limited storage space and time-consuming model switching. To address these challenges, we propose a new method MeLo (Medical image Low-rank adaptation), which enables the development of a single CAD model for multiple clinical tasks in a lightweight manner. It adopts low-rank adaptation instead of resource-demanding fine-tuning. By fixing the weight of ViT models and only adding small low-rank plug-ins, we achieve competitive results on various diagnosis tasks across different imaging modalities using only a few trainable parameters. Specifically, our proposed method achieves comparable performance to fully fine-tuned ViT models on four distinct medical imaging datasets using about 0.17% trainable parameters. Moreover, MeLo adds only about 0.5MB of storage space and allows for extremely fast model switching in deployment and inference. Our source code and pre-trained weights are available on our website (https://absterzhu.github.io/melo.github.io/).
Paper Structure (10 sections, 1 equation, 3 figures, 2 tables)

This paper contains 10 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The motivation of MeLo. The large-scale vision foundation model is just like a watermelon, and our proposed MeLo can conveniently adjust it to different clinical tasks by few additional parameters.
  • Figure 2: The illustration of our proposed MeLo. For a specific medical image diagnosis task, we inject low-rank decomposition matrices (denoted as $A$ and $B$) into the pre-trained query and value projection matrices (denoted as $W_Q$ and $W_V$) of each self-attention layer. Different module colors respond to different clinical tasks.
  • Figure 3: The AUC gradually increases as the ViT model size expands while the trainable parameters of corresponding MeLo modules remain consistently low.