Table of Contents
Fetching ...

Onco-Retriever: Generative Classifier for Retrieval of EHR Records in Oncology

Shashi Kant Gupta, Aditya Basu, Bradley Taylor, Anai Kothari, Hrituraj Singh

TL;DR

Electronic Health Records (EHRs) are largely unstructured, making retrieval and summarization of patient journeys difficult. The paper introduces Onco-Retriever, a compact, oncology-focused retriever trained via synthetic labeling from GPT-3.5 and distilled into three variants (Small, Optimized, Large), designed for local/on-premise deployment. It demonstrates superior precision and recall across 13 oncology concepts compared with Ada, Mistral, and PubMedBERT, with favorable latency for production use. The approach enables effective retrieval-augmented generation in oncology, emphasizes privacy-by-design deployment, and offers a scalable path toward domain-specific EHR search, while acknowledging limitations in generalizability and real-time use.

Abstract

Retrieving information from EHR systems is essential for answering specific questions about patient journeys and improving the delivery of clinical care. Despite this fact, most EHR systems still rely on keyword-based searches. With the advent of generative large language models (LLMs), retrieving information can lead to better search and summarization capabilities. Such retrievers can also feed Retrieval-augmented generation (RAG) pipelines to answer any query. However, the task of retrieving information from EHR real-world clinical data contained within EHR systems in order to solve several downstream use cases is challenging due to the difficulty in creating query-document support pairs. We provide a blueprint for creating such datasets in an affordable manner using large language models. Our method results in a retriever that is 30-50 F-1 points better than propriety counterparts such as Ada and Mistral for oncology data elements. We further compare our model, called Onco-Retriever, against fine-tuned PubMedBERT model as well. We conduct an extensive manual evaluation on real-world EHR data along with latency analysis of the different models and provide a path forward for healthcare organizations to build domain-specific retrievers.

Onco-Retriever: Generative Classifier for Retrieval of EHR Records in Oncology

TL;DR

Electronic Health Records (EHRs) are largely unstructured, making retrieval and summarization of patient journeys difficult. The paper introduces Onco-Retriever, a compact, oncology-focused retriever trained via synthetic labeling from GPT-3.5 and distilled into three variants (Small, Optimized, Large), designed for local/on-premise deployment. It demonstrates superior precision and recall across 13 oncology concepts compared with Ada, Mistral, and PubMedBERT, with favorable latency for production use. The approach enables effective retrieval-augmented generation in oncology, emphasizes privacy-by-design deployment, and offers a scalable path toward domain-specific EHR search, while acknowledging limitations in generalizability and real-time use.

Abstract

Retrieving information from EHR systems is essential for answering specific questions about patient journeys and improving the delivery of clinical care. Despite this fact, most EHR systems still rely on keyword-based searches. With the advent of generative large language models (LLMs), retrieving information can lead to better search and summarization capabilities. Such retrievers can also feed Retrieval-augmented generation (RAG) pipelines to answer any query. However, the task of retrieving information from EHR real-world clinical data contained within EHR systems in order to solve several downstream use cases is challenging due to the difficulty in creating query-document support pairs. We provide a blueprint for creating such datasets in an affordable manner using large language models. Our method results in a retriever that is 30-50 F-1 points better than propriety counterparts such as Ada and Mistral for oncology data elements. We further compare our model, called Onco-Retriever, against fine-tuned PubMedBERT model as well. We conduct an extensive manual evaluation on real-world EHR data along with latency analysis of the different models and provide a path forward for healthcare organizations to build domain-specific retrievers.
Paper Structure (11 sections, 2 equations, 7 figures, 1 table)

This paper contains 11 sections, 2 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Patient notes are first filtered and chunked to remove unnecessary information in the dataset. The merged chunks for all the patients in the training set are then passed through a generic retriever on the relevant expanded query set that is manually curated. Retrieved top-k chunks are then evaluated one chunk at a time using GPT-3.5. The evaluation output is generated in a way that it results in a training dataset. The retriever model is then finetuned on this set to train the onco-retriever.
  • Figure 2: Out of our training set, we first use the state of the art retriever to find top 500 chunks. We use GPT3.5 to generate and label data for each of the chunks. This results in creation of multiple (chunk, concept) label pairs which we treat as training instances
  • Figure 3: Precision and Recall Results across multiple retrievers
  • Figure 4: Average time taken by each model to process all the documents pertaining to a single patient vs F1 Scores fo
  • Figure 5: Normalised F1 scores across concepts for each retriever
  • ...and 2 more figures