Table of Contents
Fetching ...

Confidence Estimation for Error Detection in Text-to-SQL Systems

Oleg Somov, Elena Tutubalina

TL;DR

This work tackles robust Text-to-SQL by integrating selective classifiers that abstain based on uncertainty to detect erroneous or out-of-distribution queries. It evaluates multiple uncertainty-estimation strategies, including a maximum-entropy sequence score and a normalized sequence probability, across encoder-decoder (T5) and decoder-only (Llama 3) models, on SPIDER-based PAUQ and EHRSQL datasets, under domain, compositional, and covariate shifts. Gaussian Mixture clustering emerges as the strongest balance between coverage and error detection, though it often incurs higher false discovery rates, with calibration playing a crucial role: Isotonic regression consistently improves calibration across models, particularly for encoder-decoder architectures. The analysis reveals that unanswerable queries are more readily detected than incorrect ones, and query complexity does not reliably explain uncertainty, underscoring the need for calibrated selective prediction in practical Text-to-SQL deployments and highlighting directions for more robust, trustworthy NL-to-SQL systems.

Abstract

Text-to-SQL enables users to interact with databases through natural language, simplifying the retrieval and synthesis of information. Despite the success of large language models (LLMs) in converting natural language questions into SQL queries, their broader adoption is limited by two main challenges: achieving robust generalization across diverse queries and ensuring interpretative confidence in their predictions. To tackle these issues, our research investigates the integration of selective classifiers into Text-to-SQL systems. We analyse the trade-off between coverage and risk using entropy based confidence estimation with selective classifiers and assess its impact on the overall performance of Text-to-SQL models. Additionally, we explore the models' initial calibration and improve it with calibration techniques for better model alignment between confidence and accuracy. Our experimental results show that encoder-decoder T5 is better calibrated than in-context-learning GPT 4 and decoder-only Llama 3, thus the designated external entropy-based selective classifier has better performance. The study also reveal that, in terms of error detection, selective classifier with a higher probability detects errors associated with irrelevant questions rather than incorrect query generations.

Confidence Estimation for Error Detection in Text-to-SQL Systems

TL;DR

This work tackles robust Text-to-SQL by integrating selective classifiers that abstain based on uncertainty to detect erroneous or out-of-distribution queries. It evaluates multiple uncertainty-estimation strategies, including a maximum-entropy sequence score and a normalized sequence probability, across encoder-decoder (T5) and decoder-only (Llama 3) models, on SPIDER-based PAUQ and EHRSQL datasets, under domain, compositional, and covariate shifts. Gaussian Mixture clustering emerges as the strongest balance between coverage and error detection, though it often incurs higher false discovery rates, with calibration playing a crucial role: Isotonic regression consistently improves calibration across models, particularly for encoder-decoder architectures. The analysis reveals that unanswerable queries are more readily detected than incorrect ones, and query complexity does not reliably explain uncertainty, underscoring the need for calibrated selective prediction in practical Text-to-SQL deployments and highlighting directions for more robust, trustworthy NL-to-SQL systems.

Abstract

Text-to-SQL enables users to interact with databases through natural language, simplifying the retrieval and synthesis of information. Despite the success of large language models (LLMs) in converting natural language questions into SQL queries, their broader adoption is limited by two main challenges: achieving robust generalization across diverse queries and ensuring interpretative confidence in their predictions. To tackle these issues, our research investigates the integration of selective classifiers into Text-to-SQL systems. We analyse the trade-off between coverage and risk using entropy based confidence estimation with selective classifiers and assess its impact on the overall performance of Text-to-SQL models. Additionally, we explore the models' initial calibration and improve it with calibration techniques for better model alignment between confidence and accuracy. Our experimental results show that encoder-decoder T5 is better calibrated than in-context-learning GPT 4 and decoder-only Llama 3, thus the designated external entropy-based selective classifier has better performance. The study also reveal that, in terms of error detection, selective classifier with a higher probability detects errors associated with irrelevant questions rather than incorrect query generations.
Paper Structure (39 sections, 7 equations, 10 figures, 4 tables)

This paper contains 39 sections, 7 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: The interaction scenario with Text-to-SQL system. There are three major scenarios where confidence is crucial - good generation detection, error generation, and unanswerable query detection.
  • Figure 2: Heatmaps of $F_{\beta=1}$ per split and model for every selective classifier (Logistic Regression, Gaussian Mixture, and Threshold).
  • Figure 3: Left: The system risk decrease with a Gaussian Mixture for every split averaged between all SQL generation models. Right: The system coverage decrease with the presence of an Gaussian Mixture external classifier for every split averaged between all SQL generation models.
  • Figure 4: The calibration effect on T5-3B on PAUQ XSP (cross-database setting) and EHRSQL (single clinical database) compared across MinMax, Platts, and Isotonic calibration (BS stands for Brier score).
  • Figure 5: Trade-off plots between execution match and calibration for selected Text-to-SQL models (T5-large, T5-3B, Llama 3 in SFT and LoRa setting).
  • ...and 5 more figures