Table of Contents
Fetching ...

EHR-SeqSQL : A Sequential Text-to-SQL Dataset For Interactively Exploring Electronic Health Records

Jaehee Ryu, Seonhee Cho, Gyubok Lee, Edward Choi

TL;DR

EHR-SeqSQL introduces a sequential, context-aware text-to-SQL benchmark for electronic health records that targets interactivity, compositional generalization, and efficiency. The dataset transforms existing single-turn EHRSQL data into multi-turn interactions via SQL decomposition and NLQ generation, and adds a compositional split and special tokens to speed execution. Empirical results show that multi-turn training improves compositional generalization, longer interaction handling, and that the new SQL tokens reduce execution time and can improve model performance, especially for smaller models. The work provides a public resource to bridge practical hospital data exploration needs with research in text-to-SQL.

Abstract

In this paper, we introduce EHR-SeqSQL, a novel sequential text-to-SQL dataset for Electronic Health Record (EHR) databases. EHR-SeqSQL is designed to address critical yet underexplored aspects in text-to-SQL parsing: interactivity, compositionality, and efficiency. To the best of our knowledge, EHR-SeqSQL is not only the largest but also the first medical text-to-SQL dataset benchmark to include sequential and contextual questions. We provide a data split and the new test set designed to assess compositional generalization ability. Our experiments demonstrate the superiority of a multi-turn approach over a single-turn approach in learning compositionality. Additionally, our dataset integrates specially crafted tokens into SQL queries to improve execution efficiency. With EHR-SeqSQL, we aim to bridge the gap between practical needs and academic research in the text-to-SQL domain. EHR-SeqSQL is available at https://github.com/seonhee99/EHR-SeqSQL.

EHR-SeqSQL : A Sequential Text-to-SQL Dataset For Interactively Exploring Electronic Health Records

TL;DR

EHR-SeqSQL introduces a sequential, context-aware text-to-SQL benchmark for electronic health records that targets interactivity, compositional generalization, and efficiency. The dataset transforms existing single-turn EHRSQL data into multi-turn interactions via SQL decomposition and NLQ generation, and adds a compositional split and special tokens to speed execution. Empirical results show that multi-turn training improves compositional generalization, longer interaction handling, and that the new SQL tokens reduce execution time and can improve model performance, especially for smaller models. The work provides a public resource to bridge practical hospital data exploration needs with research in text-to-SQL.

Abstract

In this paper, we introduce EHR-SeqSQL, a novel sequential text-to-SQL dataset for Electronic Health Record (EHR) databases. EHR-SeqSQL is designed to address critical yet underexplored aspects in text-to-SQL parsing: interactivity, compositionality, and efficiency. To the best of our knowledge, EHR-SeqSQL is not only the largest but also the first medical text-to-SQL dataset benchmark to include sequential and contextual questions. We provide a data split and the new test set designed to assess compositional generalization ability. Our experiments demonstrate the superiority of a multi-turn approach over a single-turn approach in learning compositionality. Additionally, our dataset integrates specially crafted tokens into SQL queries to improve execution efficiency. With EHR-SeqSQL, we aim to bridge the gap between practical needs and academic research in the text-to-SQL domain. EHR-SeqSQL is available at https://github.com/seonhee99/EHR-SeqSQL.
Paper Structure (52 sections, 7 figures, 12 tables)

This paper contains 52 sections, 7 figures, 12 tables.

Figures (7)

  • Figure 1: EHRSQL vs. EHR-SeqSQL EHR-SeqSQL is a dataset that adapts the single-turn setting of EHRSQL into a multi-turn setting. The SQL queries in EHR-SeqSQL include the special tokens to refer to the previous context, which can be executed in the database with simple post-processing.
  • Figure 2: Overview of the dataset construction process. We transform EHRSQL's single text-SQL pairs into multi-turn pairs for EHR-SeqSQL by first breaking down the original SQL into subqueries (Stages 1 and 2), then merging common patterns with the BPE algorithm (Stage 3). Natural language questions (NLQs) are created for each subquery using templates and paraphrased for clarity using ChatGPT.
  • Figure 3: Prompt for paraphrasing.
  • Figure 4: Related interaction goals and their context graph.
  • Figure 5: Example of Longer Interaction.
  • ...and 2 more figures