PRISM: Patient Records Interpretation for Semantic Clinical Trial Matching using Large Language Models

Shashi Kant Gupta; Aditya Basu; Mauro Nievas; Jerrin Thomas; Nathan Wolfrath; Adhitya Ramamurthi; Bradley Taylor; Anai N. Kothari; Regina Schwind; Therica M. Miller; Sorena Nadaf-Rahrov; Yanshan Wang; Hrituraj Singh

PRISM: Patient Records Interpretation for Semantic Clinical Trial Matching using Large Language Models

Shashi Kant Gupta, Aditya Basu, Mauro Nievas, Jerrin Thomas, Nathan Wolfrath, Adhitya Ramamurthi, Bradley Taylor, Anai N. Kothari, Regina Schwind, Therica M. Miller, Sorena Nadaf-Rahrov, Yanshan Wang, Hrituraj Singh

TL;DR

PRISM tackles the real-world clinical trial matching problem by deploying a end-to-end pipeline that interprets unstructured EHR notes and trial criteria to rank eligible trials. It introduces a compositional QA framework with a scoring function $S=f\big(C(a_1,a_2,\ldots,a_j)\big)$ and, for each criterion, a probabilistic decision rule $\text{Criteria Met}=\begin{cases} \text{Yes}, & P(\text{criteria met}|\text{data})>0.66 \\ \text{No}, & P(\text{criteria met}|\text{data})<0.34 \\ \text{N/A}, & \text{otherwise} \end{cases}$, allowing robust handling of incomplete information. In extensive real-world evaluation, the OncoLLM 14B model achieves competitive criterion-level accuracy (63% overall, 66% after excluding N/As) and superior ranking performance (top-3 hits 65.3% and NDCG 0.68) compared to GPT-3.5-Turbo, while offering dramatic cost savings (~$170 vs ~$6,055 for GPT-4). The work demonstrates both patient-centric and trial-centric search capabilities, supports privacy-preserving private infrastructure deployment, and discusses practical considerations and future enhancements, such as integrating structured data and improving retrievers to further improve reliability and deployment readiness.

Abstract

Clinical trial matching is the task of identifying trials for which patients may be potentially eligible. Typically, this task is labor-intensive and requires detailed verification of patient electronic health records (EHRs) against the stringent inclusion and exclusion criteria of clinical trials. This process is manual, time-intensive, and challenging to scale up, resulting in many patients missing out on potential therapeutic options. Recent advancements in Large Language Models (LLMs) have made automating patient-trial matching possible, as shown in multiple concurrent research studies. However, the current approaches are confined to constrained, often synthetic datasets that do not adequately mirror the complexities encountered in real-world medical data. In this study, we present the first, end-to-end large-scale empirical evaluation of clinical trial matching using real-world EHRs. Our study showcases the capability of LLMs to accurately match patients with appropriate clinical trials. We perform experiments with proprietary LLMs, including GPT-4 and GPT-3.5, as well as our custom fine-tuned model called OncoLLM and show that OncoLLM, despite its significantly smaller size, not only outperforms GPT-3.5 but also matches the performance of qualified medical doctors. All experiments were carried out on real-world EHRs that include clinical notes and available clinical trials from a single cancer center in the United States.

PRISM: Patient Records Interpretation for Semantic Clinical Trial Matching using Large Language Models

TL;DR

and, for each criterion, a probabilistic decision rule

, allowing robust handling of incomplete information. In extensive real-world evaluation, the OncoLLM 14B model achieves competitive criterion-level accuracy (63% overall, 66% after excluding N/As) and superior ranking performance (top-3 hits 65.3% and NDCG 0.68) compared to GPT-3.5-Turbo, while offering dramatic cost savings (~

6,055 for GPT-4). The work demonstrates both patient-centric and trial-centric search capabilities, supports privacy-preserving private infrastructure deployment, and discusses practical considerations and future enhancements, such as integrating structured data and improving retrievers to further improve reliability and deployment readiness.

Abstract

Paper Structure (22 sections, 8 equations, 12 figures)

This paper contains 22 sections, 8 equations, 12 figures.

Introduction
Related Work
Clinical Trial Matching
Large Language Models
Results
Criteria/Question Level Accuracy
Ranking Scores
Patient-Centric Ranking
Trial-Centric Ranking
Error Analysis
Cost-Benefit Analysis
Methods
Problem Formulation
Dataset Preparation
PRISM Pipeline
...and 7 more sections

Figures (12)

Figure 1: The pipeline only uses unstructured notes to effectively match the patients to potential clinical trials. Patient notes are first filtered as per the defined rules and are then chunked using a contextual chunker. The chunks are then stored in a database. The trial criteria are ingested as plain text as extracted from clinicaltrial.gov and are converted into a graphical question representation as described in Section \ref{['sec:methods']}. This graph is then used to retrieve relevant snippets of information, and our proprietary fine-tuned language model calculates a score for the graph. We then also apply weights to that graph using our developed heuristics, which allow the pipeline to rank the trials accurately.
Figure 2: (a) OncoLLM outperforms most of the prominent LLMs at criteria/question level answering accuracy. First column All shows the question level accuracy across all the 720 Q&A dataset for oncology related clinical trials. Second column Without N/A samples shows question level accuracy after removing those questions whose answers were 'N/A' by medical experts. * Human accuracy was obtained only on 109 questions which was annotated by two medical experts. (b) OncoLLM (in red) performs consistently well across all the relevant oncology related concepts.
Figure 3: Accuracy Comparison Based on Model Size and Number of "N/A" Outputs. This figure presents a comparison of model accuracy with the frequency of "N/A" outputs. A higher frequency of "N/A" outputs indicates lower usefulness of the model. The size of each bubble represents the number of parameters of the model. This highlights the close performance of OncoLLM to GPT4 despite having relatively fewer parameters.
Figure 4: OncoLLM with Weighted Tier scoring method performs best for both way search.(a) OncoLLM (Weighted Tier) ranked ground truth trials 65.3% of times in the top-3 among 10 considered trials, while GPT3.5-Turbo (Iterative Tier) ranked ground truth trials only 61.2% of times in the top-3. (b) OncoLLM (Weighted Tier) scored an NDCG score of 68% as compared to 62.6% of GPT3.5-Turbo (Iterative Tier). See Section \ref{['sec:scoring_module']} for details on the scoring methods.
Figure 5: Criteria/Question Level Analysis on 98 Patient, Ground Truth Trial Pairs. A. Criteria level Met/Not-Met/NA stats for all the ground truth trials. B. Criteria level Met/Not-Met/NA stats where the ground truth trial ranked within the top-3. C. Question level N/A stats where the ground truth trial ranked within the top-3 (lower is better).
...and 7 more figures

PRISM: Patient Records Interpretation for Semantic Clinical Trial Matching using Large Language Models

TL;DR

Abstract

PRISM: Patient Records Interpretation for Semantic Clinical Trial Matching using Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (12)