Efficiency at Scale: Investigating the Performance of Diminutive Language Models in Clinical Tasks

Niall Taylor; Upamanyu Ghose; Omid Rohanian; Mohammadmahdi Nouriborji; Andrey Kormilitzin; David Clifton; Alejo Nevado-Holgado

Efficiency at Scale: Investigating the Performance of Diminutive Language Models in Clinical Tasks

Niall Taylor, Upamanyu Ghose, Omid Rohanian, Mohammadmahdi Nouriborji, Andrey Kormilitzin, David Clifton, Alejo Nevado-Holgado

TL;DR

The paper addresses how to achieve efficient clinical NLP with minimal computation by evaluating parameter-efficient fine-tuning (PEFT) methods across a spectrum of model sizes, including very small LLMs. It systematically compares LoRA and IA^3, showing LoRA delivers robust, near-full-finetuned performance across tasks and domains, while domain-pretraining (biomedical/clinical) enhances efficiency and accuracy, especially for smaller models. Through experiments on MIMIC-III and I2B2 datasets, it demonstrates that model size, PEFT choice, and data domain interact to shape cost, time, and performance; larger models offer gains but at steep resource costs, whereas compact models with LoRA achieve strong efficiency-performance trade-offs suitable for in-house deployment. The findings suggest prioritizing LoRA-based PEFT and domain-specific pre-training to realize practical, cost-effective clinical AI systems, with larger LLMs reserved for scenarios where maximum performance justifies the expense.

Abstract

The entry of large language models (LLMs) into research and commercial spaces has led to a trend of ever-larger models, with initial promises of generalisability, followed by a widespread desire to downsize and create specialised models without the need for complete fine-tuning, using Parameter Efficient Fine-tuning (PEFT) methods. We present an investigation into the suitability of different PEFT methods to clinical decision-making tasks, across a range of model sizes, including extremely small models with as few as $25$ million parameters. Our analysis shows that the performance of most PEFT approaches varies significantly from one task to another, with the exception of LoRA, which maintains relatively high performance across all model sizes and tasks, typically approaching or matching full fine-tuned performance. The effectiveness of PEFT methods in the clinical domain is evident, particularly for specialised models which can operate on low-cost, in-house computing infrastructure. The advantages of these models, in terms of speed and reduced training costs, dramatically outweighs any performance gain from large foundation LLMs. Furthermore, we highlight how domain-specific pre-training interacts with PEFT methods and model size, and discuss how these factors interplay to provide the best efficiency-performance trade-off. Full code available at: tbd.

Efficiency at Scale: Investigating the Performance of Diminutive Language Models in Clinical Tasks

TL;DR

Abstract

million parameters. Our analysis shows that the performance of most PEFT approaches varies significantly from one task to another, with the exception of LoRA, which maintains relatively high performance across all model sizes and tasks, typically approaching or matching full fine-tuned performance. The effectiveness of PEFT methods in the clinical domain is evident, particularly for specialised models which can operate on low-cost, in-house computing infrastructure. The advantages of these models, in terms of speed and reduced training costs, dramatically outweighs any performance gain from large foundation LLMs. Furthermore, we highlight how domain-specific pre-training interacts with PEFT methods and model size, and discuss how these factors interplay to provide the best efficiency-performance trade-off. Full code available at: tbd.

Paper Structure (42 sections, 3 equations, 5 figures, 6 tables)

This paper contains 42 sections, 3 equations, 5 figures, 6 tables.

Introduction
Scales of LLM
Fine-tuning and PEFT
Clinical domain - LLM adaptation
Related work
Methods
Model architectures
Domain pre-training
Downstream fine-tuning
PEFT
Low-Rank Adaptation of Large Language Models
$IA^3$
Few-Shot training
Datasets and Tasks
Sequence classification tasks
...and 27 more sections

Figures (5)

Figure 1: Sequence classification performance across the different LLM model sizes and the associated number of trainable parameters.
Figure 2: Comparison of F1 micro scores on the I2B2 2010 relation extraction task dependent on whether the model received biomedical, clinical, or general domain pre-training.
Figure 3: Effect of training time (a) and few-shot sampling (b) on models of varying sizes, trained using full fine-tuning as well as LoRA. The connected points reflect the LoRA results to highlight the trend. The task used for this experiment was MIMIC mortality prediction.
Figure 4: Comparison of efficiency against performance on the validation set between models of different size.
Figure 5: Differential effect of LoRA rank on performance of a model. The y-axis represents the difference in AUROC between the rank on the x-axis and rank=8.

Efficiency at Scale: Investigating the Performance of Diminutive Language Models in Clinical Tasks

TL;DR

Abstract

Efficiency at Scale: Investigating the Performance of Diminutive Language Models in Clinical Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (5)