Developing Healthcare Language Model Embedding Spaces

Niall Taylor; Dan Schofield; Andrey Kormilitzin; Dan W Joyce; Alejo Nevado-Holgado

Developing Healthcare Language Model Embedding Spaces

Niall Taylor, Dan Schofield, Andrey Kormilitzin, Dan W Joyce, Alejo Nevado-Holgado

TL;DR

This work addresses the challenge of deploying effective language models in healthcare by pre-training smaller LLMs to produce document-level healthcare embeddings. It systematically compares three pre-training objectives—continued MLM, DeCLUTR-style contrastive learning, and a note-category metadata objective—across three UK NHS datasets, evaluating both downstream classification performance and embedding-space structure. Contrastive objectives, particularly when models are frozen, provide the strongest performance with limited labeled data, while note-category pre-training mainly improves clustering separability rather than classification accuracy. The findings underscore the value of domain-adapted, resource-efficient embedding strategies for privacy-conscious healthcare settings and offer practical guidelines for pre-training healthcare LLMs with an emphasis on contrastive objectives and embedding analysis.

Abstract

Pre-trained Large Language Models (LLMs) often struggle on out-of-domain datasets like healthcare focused text. We explore specialized pre-training to adapt smaller LLMs to different healthcare datasets. Three methods are assessed: traditional masked language modeling, Deep Contrastive Learning for Unsupervised Textual Representations (DeCLUTR), and a novel pre-training objective utilizing metadata categories from the healthcare settings. These schemes are evaluated on downstream document classification tasks for each dataset, with additional analysis of the resultant embedding spaces. Contrastively trained models outperform other approaches on the classification tasks, delivering strong performance from limited labeled data and with fewer model parameter updates required. While metadata-based pre-training does not further improve classifications across the datasets, it yields interesting embedding cluster separability. All domain adapted LLMs outperform their publicly available general base LLM, validating the importance of domain-specialization. This research illustrates efficient approaches to instill healthcare competency in compact LLMs even under tight computational budgets, an essential capability for responsible and sustainable deployment in local healthcare settings. We provide pre-training guidelines for specialized healthcare LLMs, motivate continued inquiry into contrastive objectives, and demonstrates adaptation techniques to align small LLMs with privacy-sensitive medical tasks.

Developing Healthcare Language Model Embedding Spaces

TL;DR

Abstract

Paper Structure (59 sections, 9 equations, 7 figures, 16 tables)

This paper contains 59 sections, 9 equations, 7 figures, 16 tables.

Introduction
Document-level label-free embeddings
The Healthcare Text Domain
The UK NHS
Motivation and Related Work
Methods
Datasets and downstream tasks
MIMIC-III
Oxford Health Foundation Trust - OHFT
NHS Patient Safety Incident Reports - PSIR
Note Category - All Datasets
Data splits
Language modelling - Preliminaries
Continued Masked Language Modelling
Contrastive Loss Pre-training
...and 44 more sections

Figures (7)

Figure 1: Adapted from giorgi_declutr_2021. Overview of the DeCLUTR training process. We sample anchor spans $s_i$ and positive spans $s_j$ from each document $d$ in a minibatch of size $N$. For simplicity, we show $A=P=1$, where $A$ and $P$ are the number of anchors and positives per document. The spans are encoded by $f_{enc}()$ and pooled by $g(\cdot)$ to get embeddings $e_i = g(f(s_i))$ and $e_j = g(f(s_j))$. The encoder and pooler are trained to minimize the distance between positive span pairs while maximizing the distance to negatives (omitted for simplicity).
Figure 2: Overview of our note category pre-training approach. On the left side (A) shows the flow of the input sequence ($\textbf{x}$) through the standard MLM pipeline, and on the right side (B) shows the integration of the associated note category label in parallel. The MLM and note category classification objectives are jointly optimised with each document.
Figure 3: F1 macro score on evaluation set for the MIMIC-III ICD-9 Triage task with frozen LLMs trained with different sample sizes per class.
Figure 4: F1 macro score on evaluation set for the MIMIC-III ICD-9 Triage task with varying transformer layers frozen (all models utilised a 12-layer RoBERTa architecture).
Figure 5: Cosine similarity of document embeddings within and between classes for the MIMIC-III ICD-9 triage dataset. Note the y-axis scales are separate for each subplot, this is due to the large differences in value ranges between models.
...and 2 more figures

Developing Healthcare Language Model Embedding Spaces

TL;DR

Abstract

Developing Healthcare Language Model Embedding Spaces

Authors

TL;DR

Abstract

Table of Contents

Figures (7)