Table of Contents
Fetching ...

ProbGate at EHRSQL 2024: Enhancing SQL Query Generation Accuracy through Probabilistic Threshold Filtering and Error Handling

Sangryul Kim, Donghee Han, Sehyun Kim

TL;DR

The paper tackles reliable Text-to-SQL generation in electronic health records by addressing unanswerable queries. It introduces ProbGate, a two-stage pipeline that uses token-level log probabilities to filter uncertain outputs and an execution-based grammatical-error filter to ensure syntactic validity. Fine-tuned GPT-3.5-turbo models, combined with thoughtful prompt design and the RS reliability metric, yield strong performance on the EHRSQL-2024 dataset, outperforming binary classifiers and other baselines. The approach demonstrates practical applicability in high-stakes medical settings, though it acknowledges limitations related to data distribution and reliance on closed-source models, suggesting directions toward open-source LLMs and distribution-aware filtering.

Abstract

Recently, deep learning-based language models have significantly enhanced text-to-SQL tasks, with promising applications in retrieving patient records within the medical domain. One notable challenge in such applications is discerning unanswerable queries. Through fine-tuning model, we demonstrate the feasibility of converting medical record inquiries into SQL queries. Additionally, we introduce an entropy-based method to identify and filter out unanswerable results. We further enhance result quality by filtering low-confidence SQL through log probability-based distribution, while grammatical and schema errors are mitigated by executing queries on the actual database. We experimentally verified that our method can filter unanswerable questions, which can be widely utilized even when the parameters of the model are not accessible, and that it can be effectively utilized in practice.

ProbGate at EHRSQL 2024: Enhancing SQL Query Generation Accuracy through Probabilistic Threshold Filtering and Error Handling

TL;DR

The paper tackles reliable Text-to-SQL generation in electronic health records by addressing unanswerable queries. It introduces ProbGate, a two-stage pipeline that uses token-level log probabilities to filter uncertain outputs and an execution-based grammatical-error filter to ensure syntactic validity. Fine-tuned GPT-3.5-turbo models, combined with thoughtful prompt design and the RS reliability metric, yield strong performance on the EHRSQL-2024 dataset, outperforming binary classifiers and other baselines. The approach demonstrates practical applicability in high-stakes medical settings, though it acknowledges limitations related to data distribution and reliance on closed-source models, suggesting directions toward open-source LLMs and distribution-aware filtering.

Abstract

Recently, deep learning-based language models have significantly enhanced text-to-SQL tasks, with promising applications in retrieving patient records within the medical domain. One notable challenge in such applications is discerning unanswerable queries. Through fine-tuning model, we demonstrate the feasibility of converting medical record inquiries into SQL queries. Additionally, we introduce an entropy-based method to identify and filter out unanswerable results. We further enhance result quality by filtering low-confidence SQL through log probability-based distribution, while grammatical and schema errors are mitigated by executing queries on the actual database. We experimentally verified that our method can filter unanswerable questions, which can be widely utilized even when the parameters of the model are not accessible, and that it can be effectively utilized in practice.
Paper Structure (19 sections, 1 equation, 3 figures, 2 tables, 1 algorithm)

This paper contains 19 sections, 1 equation, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Determines whether a question and the generated SQL are answerable or unanswerable based on the log probability of the tokens generated by the Text2SQL model. If the log probability of a token falls below a certain threshold, we classify the question and SQL as unanswerable.
  • Figure 2: Our method's overall architecture is as follows: During training, we fine-tune the gpt-3.5-turbo model using a dataset from which unanswerable cases have been removed. Subsequently, we identify unanswerable cases using filtering based on log probability and filtering through SQL execution, ultimately deriving the answers.
  • Figure 3: Left - Log Probability Distribution of the Fine-Tuned Model, Right - Log Probability Distribution of the Unfine-Tuned Model