Table of Contents
Fetching ...

Large Language Models for Medical Forecasting -- Foresight 2

Zeljko Kraljevic, Joshua Au Yeung, Daniel Bean, James Teo, Richard J. Dobson

TL;DR

FS2 trains open LLMs on hospital EHR data to model contextualized patient timelines and predict SNOMED-coded biomedical concepts from clinical notes, enhancing next-concept and risk forecasting performance. Built on LLaMAv2-7B and Mistralv0.1-7B, FS2 expands SNOMED in the tokenizer and employs a timeline-based supervision scheme, yielding large gains over FS1 and outperforming GPT-4-turbo on risk tasks. The study demonstrates that fine-tuning on high-quality, specialised data with contextualized timelines can enable smaller models to surpass larger models on real-world medical prediction tasks, while acknowledging limitations in ontology coverage, NER quality, and deployment readiness. It highlights practical utilities in high-precision alerts and risk stratification, while calling for broader datasets and alignment to advance clinical applicability.

Abstract

Foresight 2 (FS2) is a large language model fine-tuned on hospital data for modelling patient timelines (GitHub 'removed for anon'). It can understand patients' clinical notes and predict SNOMED codes for a wide range of biomedical use cases, including diagnosis suggestions, risk forecasting, and procedure and medication recommendations. FS2 is trained on the free text portion of the MIMIC-III dataset, firstly through extracting biomedical concepts and then creating contextualised patient timelines, upon which the model is then fine-tuned. The results show significant improvement over the previous state-of-the-art for the next new biomedical concept prediction (P/R - 0.73/0.66 vs 0.52/0.32) and a similar improvement specifically for the next new disorder prediction (P/R - 0.69/0.62 vs 0.46/0.25). Finally, on the task of risk forecast, we compare our model to GPT-4-turbo (and a range of open-source biomedical LLMs) and show that FS2 performs significantly better on such tasks (P@5 - 0.90 vs 0.65). This highlights the need to incorporate hospital data into LLMs and shows that small models outperform much larger ones when fine-tuned on high-quality, specialised data.

Large Language Models for Medical Forecasting -- Foresight 2

TL;DR

FS2 trains open LLMs on hospital EHR data to model contextualized patient timelines and predict SNOMED-coded biomedical concepts from clinical notes, enhancing next-concept and risk forecasting performance. Built on LLaMAv2-7B and Mistralv0.1-7B, FS2 expands SNOMED in the tokenizer and employs a timeline-based supervision scheme, yielding large gains over FS1 and outperforming GPT-4-turbo on risk tasks. The study demonstrates that fine-tuning on high-quality, specialised data with contextualized timelines can enable smaller models to surpass larger models on real-world medical prediction tasks, while acknowledging limitations in ontology coverage, NER quality, and deployment readiness. It highlights practical utilities in high-precision alerts and risk stratification, while calling for broader datasets and alignment to advance clinical applicability.

Abstract

Foresight 2 (FS2) is a large language model fine-tuned on hospital data for modelling patient timelines (GitHub 'removed for anon'). It can understand patients' clinical notes and predict SNOMED codes for a wide range of biomedical use cases, including diagnosis suggestions, risk forecasting, and procedure and medication recommendations. FS2 is trained on the free text portion of the MIMIC-III dataset, firstly through extracting biomedical concepts and then creating contextualised patient timelines, upon which the model is then fine-tuned. The results show significant improvement over the previous state-of-the-art for the next new biomedical concept prediction (P/R - 0.73/0.66 vs 0.52/0.32) and a similar improvement specifically for the next new disorder prediction (P/R - 0.69/0.62 vs 0.46/0.25). Finally, on the task of risk forecast, we compare our model to GPT-4-turbo (and a range of open-source biomedical LLMs) and show that FS2 performs significantly better on such tasks (P@5 - 0.90 vs 0.65). This highlights the need to incorporate hospital data into LLMs and shows that small models outperform much larger ones when fine-tuned on high-quality, specialised data.

Paper Structure

This paper contains 20 sections, 1 equation, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Data preparation workflow: 1) We collect all free text documents from the patient EHR; 2) Extract mentions of SNOMED-CT concepts and combine the concepts with static data like sex, ethnicity and age; 3) Clean, filter and bucketed concepts and turn them into a patient timeline; and lastly 4) From the concepts in the timeline, based on the context where each one was found, reconstruct a singular clinical note for each patient.