Parameter-Efficient Fine-Tuning of LLaMA for the Clinical Domain
Aryo Pradipta Gema, Pasquale Minervini, Luke Daines, Tom Hope, Beatrice Alex
TL;DR
The paper tackles the expensive process of domain adaptation for large language models in clinical settings by proposing a two-step parameter-efficient fine-tuning framework, combining Clinical LLaMA-LoRA for domain adaptation with Downstream LLaMA-LoRA for task-specific fine-tuning. It demonstrates that a small, domain-focused PEFT adaptor can achieve AUROC gains across multiple clinical downstream tasks, including large-scale multilabel diagnoses and procedures classification, while reducing training time and computational requirements. The study provides extensive empirical analysis comparing LoRA and other PEFT methods, showing that trainable CL-LLaMA-LoRA, especially when augmented with Downstream LLaMA-LoRA, yields the best macro-averaged AUROC scores and can outperform some clinically trained LMs. Overall, the framework offers a practical, resource-efficient pathway to deploy clinical LLMs with strong predictive performance, while highlighting limitations related to data diversity and potential spurious correlations.
Abstract
Adapting pretrained language models to novel domains, such as clinical applications, traditionally involves retraining their entire set of parameters. Parameter-Efficient Fine-Tuning (PEFT) techniques for fine-tuning language models significantly reduce computational requirements by selectively fine-tuning small subsets of parameters. In this study, we propose a two-step PEFT framework and evaluate it in the clinical domain. Our approach combines a specialised PEFT adapter layer designed for clinical domain adaptation with another adapter specialised for downstream tasks. We evaluate the framework on multiple clinical outcome prediction datasets, comparing it to clinically trained language models. Our framework achieves a better AUROC score averaged across all clinical downstream tasks compared to clinical language models. In particular, we observe large improvements of 4-5% AUROC in large-scale multilabel classification tasks, such as diagnoses and procedures classification. To our knowledge, this study is the first to provide an extensive empirical analysis of the interplay between PEFT techniques and domain adaptation in an important real-world domain of clinical applications.
