Table of Contents
Fetching ...

FHIRPath-QA: Executable Question Answering over FHIR Electronic Health Records

Michael Frew, Nishit Bheda, Bryan Tripp

TL;DR

FHIRPath-QA is introduced, the first open dataset and benchmark for patient-specific QA that includes open-standard FHIRPath queries over real-world clinical data and it is demonstrated that state-of-the-art LLMs struggle to deal with ambiguity in patient language and perform poorly in FHIRPath query synthesis.

Abstract

Though patients are increasingly granted digital access to their electronic health records (EHRs), existing interfaces may not support precise, trustworthy answers to patient-specific questions. Large language models (LLM) show promise in clinical question answering (QA), but retrieval-based approaches are computationally inefficient, prone to hallucination, and difficult to deploy over real-life EHRs. In this work, we introduce FHIRPath-QA, the first open dataset and benchmark for patient-specific QA that includes open-standard FHIRPath queries over real-world clinical data. We propose a text-to-FHIRPath QA paradigm that shifts reasoning from free-text generation to FHIRPath query synthesis, significantly reducing LLM usage. Built on MIMIC-IV on FHIR Demo, the dataset pairs over 14k natural language questions in patient and clinician phrasing with validated FHIRPath queries and answers. Further, we demonstrate that state-of-the-art LLMs struggle to deal with ambiguity in patient language and perform poorly in FHIRPath query synthesis. However, they benefit strongly from supervised fine-tuning. Our results highlight that text-to-FHIRPath synthesis has the potential to serve as a practical foundation for safe, efficient, and interoperable consumer health applications, and our dataset and benchmark serve as a starting point for future research on the topic. The full dataset and generation code is available at: https://github.com/mooshifrew/fhirpath-qa.

FHIRPath-QA: Executable Question Answering over FHIR Electronic Health Records

TL;DR

FHIRPath-QA is introduced, the first open dataset and benchmark for patient-specific QA that includes open-standard FHIRPath queries over real-world clinical data and it is demonstrated that state-of-the-art LLMs struggle to deal with ambiguity in patient language and perform poorly in FHIRPath query synthesis.

Abstract

Though patients are increasingly granted digital access to their electronic health records (EHRs), existing interfaces may not support precise, trustworthy answers to patient-specific questions. Large language models (LLM) show promise in clinical question answering (QA), but retrieval-based approaches are computationally inefficient, prone to hallucination, and difficult to deploy over real-life EHRs. In this work, we introduce FHIRPath-QA, the first open dataset and benchmark for patient-specific QA that includes open-standard FHIRPath queries over real-world clinical data. We propose a text-to-FHIRPath QA paradigm that shifts reasoning from free-text generation to FHIRPath query synthesis, significantly reducing LLM usage. Built on MIMIC-IV on FHIR Demo, the dataset pairs over 14k natural language questions in patient and clinician phrasing with validated FHIRPath queries and answers. Further, we demonstrate that state-of-the-art LLMs struggle to deal with ambiguity in patient language and perform poorly in FHIRPath query synthesis. However, they benefit strongly from supervised fine-tuning. Our results highlight that text-to-FHIRPath synthesis has the potential to serve as a practical foundation for safe, efficient, and interoperable consumer health applications, and our dataset and benchmark serve as a starting point for future research on the topic. The full dataset and generation code is available at: https://github.com/mooshifrew/fhirpath-qa.
Paper Structure (28 sections, 3 figures, 4 tables)

This paper contains 28 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: FHIRPath-QA construction process.
  • Figure 2: The query generated to answer the question Were any organisms found in my last SEROLOGY/BLOOD microbiology test this year? Colored regions indicate logical stages of the query: resource type selection (blue), specimen filtering (green), temporal constraints (orange), and selection of the most recent observation (purple). A final post-aggregation constraint checks that an organism was found (grey).
  • Figure 3: Answer accuracy of base and fine-tuned models for training sets containing varying numbers of patients (10, 30, 100) for the query-first generation pipeline.