Table of Contents
Fetching ...

LG AI Research & KAIST at EHRSQL 2024: Self-Training Large Language Models with Pseudo-Labeled Unanswerable Questions for a Reliable Text-to-SQL System on EHRs

Yongrae Jo, Seongyun Lee, Minju Seo, Sung Ju Hwang, Moontae Lee

TL;DR

This work addresses the critical need for reliable text-to-SQL in healthcare by enabling clinicians to query EHRs without SQL expertise while correctly identifying unanswerable questions to avoid misinformation. The authors introduce PLUQ, a two-stage self-training framework that augments training data with pseudo-labeled unanswerable questions and applies post-hoc filtering based on token entropy and query execution against the MIMIC-IV demo database. Through training with a seed model, pseudo-labeling, and careful filtering, PLUQ achieves top performance on the EHRSQL 2024 shared task, particularly optimizing the Reliability Score at RS(10). The approach advances reliable, interpretable access to EHR data and demonstrates practical potential for safer clinical decision support, while also highlighting limitations in generalization and the need for further refinement of reliability metrics.

Abstract

Text-to-SQL models are pivotal for making Electronic Health Records (EHRs) accessible to healthcare professionals without SQL knowledge. With the advancements in large language models, these systems have become more adept at translating complex questions into SQL queries. Nonetheless, the critical need for reliability in healthcare necessitates these models to accurately identify unanswerable questions or uncertain predictions, preventing misinformation. To address this problem, we present a self-training strategy using pseudo-labeled unanswerable questions to enhance the reliability of text-to-SQL models for EHRs. This approach includes a two-stage training process followed by a filtering method based on the token entropy and query execution. Our methodology's effectiveness is validated by our top performance in the EHRSQL 2024 shared task, showcasing the potential to improve healthcare decision-making through more reliable text-to-SQL systems.

LG AI Research & KAIST at EHRSQL 2024: Self-Training Large Language Models with Pseudo-Labeled Unanswerable Questions for a Reliable Text-to-SQL System on EHRs

TL;DR

This work addresses the critical need for reliable text-to-SQL in healthcare by enabling clinicians to query EHRs without SQL expertise while correctly identifying unanswerable questions to avoid misinformation. The authors introduce PLUQ, a two-stage self-training framework that augments training data with pseudo-labeled unanswerable questions and applies post-hoc filtering based on token entropy and query execution against the MIMIC-IV demo database. Through training with a seed model, pseudo-labeling, and careful filtering, PLUQ achieves top performance on the EHRSQL 2024 shared task, particularly optimizing the Reliability Score at RS(10). The approach advances reliable, interpretable access to EHR data and demonstrates practical potential for safer clinical decision support, while also highlighting limitations in generalization and the need for further refinement of reliability metrics.

Abstract

Text-to-SQL models are pivotal for making Electronic Health Records (EHRs) accessible to healthcare professionals without SQL knowledge. With the advancements in large language models, these systems have become more adept at translating complex questions into SQL queries. Nonetheless, the critical need for reliability in healthcare necessitates these models to accurately identify unanswerable questions or uncertain predictions, preventing misinformation. To address this problem, we present a self-training strategy using pseudo-labeled unanswerable questions to enhance the reliability of text-to-SQL models for EHRs. This approach includes a two-stage training process followed by a filtering method based on the token entropy and query execution. Our methodology's effectiveness is validated by our top performance in the EHRSQL 2024 shared task, showcasing the potential to improve healthcare decision-making through more reliable text-to-SQL systems.
Paper Structure (26 sections, 1 equation, 3 figures, 4 tables)

This paper contains 26 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Training Process and SQL Query Generation. The model is initially trained using the training set. Then, a SQL query (or null) is generated for each sample in the test set using the trained model. Subsequently, we select $K$ null samples and add them to the training set, resulting in a null-augmented training set. This augmented dataset is then used to train the final model, denoted as $M_{t+1}$.
  • Figure 2: Formal Definition of RS for a single data instance. $Q_{\text{una}}$ denotes unanswerable question, $Q_{\text{ans}}$ represents answerable question. $g(x)=1$ means that model generates SQL query and $g(x)=0$ denotes that model generates 'null'. $Acc(x) = 1$ signifies instances where the model's prediction is correct, while $Acc(x) = 0$ indicates cases where the prediction is incorrect. $c$ represents the penalty.
  • Figure 3: The system prompt and the user prompt template used in PLUQ. The prompt integrates instructions for handling unanswerable questions and the MIMIC-IV database schema.