Overview of the EHRSQL 2024 Shared Task on Reliable Text-to-SQL Modeling on Electronic Health Records
Gyubok Lee, Sunjun Kweon, Seongsu Bae, Edward Choi
TL;DR
This paper introduces the EHRSQL 2024 shared task, which targets reliable text-to-SQL modeling on EHRs by requiring models to both generate correct SQL for answerable questions and abstain for unanswerable ones, using the RS(10) metric as the main evaluation. The dataset construction combines real-world clinician templates, paraphrases generated via LLMs, and adversarial unanswerable questions, with a new data split that includes seen and unseen templates to simulate distribution shifts. Eight teams pursued unified or pipeline-based approaches, with unified methods generally outperforming pipelines and top results achieving RS(10) ≈ 81.32, demonstrating substantial progress toward reliable clinical QA while acknowledging remaining challenges for RS(N). The study highlights effective strategies such as self-training with pseudo-labels, ensemble prompting with abstention, synthetic data generation, and abstention-aware generation, and points to future directions including achieving RS(N) and extending reliable QA to multimodal EHR data.
Abstract
Electronic Health Records (EHRs) are relational databases that store the entire medical histories of patients within hospitals. They record numerous aspects of patients' medical care, from hospital admission and diagnosis to treatment and discharge. While EHRs are vital sources of clinical data, exploring them beyond a predefined set of queries requires skills in query languages like SQL. To make information retrieval more accessible, one strategy is to build a question-answering system, possibly leveraging text-to-SQL models that can automatically translate natural language questions into corresponding SQL queries and use these queries to retrieve the answers. The EHRSQL 2024 shared task aims to advance and promote research in developing a question-answering system for EHRs using text-to-SQL modeling, capable of reliably providing requested answers to various healthcare professionals to improve their clinical work processes and satisfy their needs. Among more than 100 participants who applied to the shared task, eight teams were formed and completed the entire shared task requirement and demonstrated a wide range of methods to effectively solve this task. In this paper, we describe the task of reliable text-to-SQL modeling, the dataset, and the methods and results of the participants. We hope this shared task will spur further research and insights into developing reliable question-answering systems for EHRs.
