Efficiency in Focus: LayerNorm as a Catalyst for Fine-tuning Medical Visual Language Pre-trained Models
Jiawei Chen, Dingkang Yang, Yue Jiang, Mingcheng Li, Jinjie Wei, Xiaolu Hou, Lihua Zhang
TL;DR
The paper tackles the challenge of fine-tuning Medical Visual Language Models under limited data and resources by shifting focus from extrinsic PEFT methods to intrinsic parameter fine-tuning, specifically LayerNorm. Through systematic experiments on small-scale (MISS) and large-scale (LLaVA-Med) Med-VLMs across Med-VQA and Med-IRG tasks, it demonstrates that LayerNorm-tuning provides superior efficiency and often superior or comparable accuracy with far fewer trainable parameters than conventional PEFT approaches. The work reveals distinct behaviors across modules and scales: LayerNorm-tuning shows robust generalization, good zero-shot and OOD performance, and maintains clinical relevance, while other methods like FFN-tuning can improve certain metrics at the cost of parameter volume and generalization. Overall, LN-tuning emerges as a practical and scalable strategy for adapting Med-VLMs to specialized medical tasks, enabling broader deployment with limited computational resources.
Abstract
In the realm of Medical Visual Language Models (Med-VLMs), the quest for universal efficient fine-tuning mechanisms remains paramount, especially given researchers in interdisciplinary fields are often extremely short of training resources, yet largely unexplored. Given the unique challenges in the medical domain, such as limited data scope and significant domain-specific requirements, evaluating and adapting Parameter-Efficient Fine-Tuning (PEFT) methods specifically for Med-VLMs is essential. Most of the current PEFT methods on Med-VLMs have yet to be comprehensively investigated but mainly focus on adding some components to the model's structure or input. However, fine-tuning intrinsic model components often yields better generality and consistency, and its impact on the ultimate performance of Med-VLMs has been widely overlooked and remains understudied. In this paper, we endeavour to explore an alternative to traditional PEFT methods, especially the impact of fine-tuning LayerNorm layers, FFNs and Attention layers on the Med-VLMs. Our comprehensive studies span both small-scale and large-scale Med-VLMs, evaluating their performance under various fine-tuning paradigms across tasks such as Medical Visual Question Answering and Medical Imaging Report Generation. The findings reveal unique insights into the effects of intrinsic parameter fine-tuning methods on fine-tuning Med-VLMs to downstream tasks and expose fine-tuning solely the LayerNorm layers not only surpasses the efficiency of traditional PEFT methods but also retains the model's accuracy and generalization capabilities across a spectrum of medical downstream tasks. The experiments show LayerNorm fine-tuning's superior adaptability and scalability, particularly in the context of large-scale Med-VLMs.
