Table of Contents
Fetching ...

Text-to-SQL Calibration: No Need to Ask -- Just Rescale Model Probabilities

Ashwin Ramachandran, Sunita Sarawagi

TL;DR

This work investigates calibration techniques for assigning confidence to generated SQL queries and shows that a straightforward baseline -- deriving confidence from the model's full-sequence probability -- outperforms recent methods that rely on follow-up prompts for self-checking and confidence verbalization.

Abstract

Calibration is crucial as large language models (LLMs) are increasingly deployed to convert natural language queries into SQL for commercial databases. In this work, we investigate calibration techniques for assigning confidence to generated SQL queries. We show that a straightforward baseline -- deriving confidence from the model's full-sequence probability -- outperforms recent methods that rely on follow-up prompts for self-checking and confidence verbalization. Our comprehensive evaluation, conducted across two widely-used Text-to-SQL benchmarks and multiple LLM architectures, provides valuable insights into the effectiveness of various calibration strategies.

Text-to-SQL Calibration: No Need to Ask -- Just Rescale Model Probabilities

TL;DR

This work investigates calibration techniques for assigning confidence to generated SQL queries and shows that a straightforward baseline -- deriving confidence from the model's full-sequence probability -- outperforms recent methods that rely on follow-up prompts for self-checking and confidence verbalization.

Abstract

Calibration is crucial as large language models (LLMs) are increasingly deployed to convert natural language queries into SQL for commercial databases. In this work, we investigate calibration techniques for assigning confidence to generated SQL queries. We show that a straightforward baseline -- deriving confidence from the model's full-sequence probability -- outperforms recent methods that rely on follow-up prompts for self-checking and confidence verbalization. Our comprehensive evaluation, conducted across two widely-used Text-to-SQL benchmarks and multiple LLM architectures, provides valuable insights into the effectiveness of various calibration strategies.

Paper Structure

This paper contains 23 sections, 5 figures, 18 tables.

Figures (5)

  • Figure 1: The reliability plots compare calibration across different methods and models. The top four plots use predictions from the Spider dataset and bottom four from the BIRD dataset. Plots have been generated with uniform binning and isotonic scaling. A well-calibrated plot aligns closely with the x=y line. Each point is color-coded based on the number of samples in the bin, as indicated by the colorbar on the right.
  • Figure 2: The plots have been generated using Monotonic binning in place of Uniform binning used in \ref{['fig:plot-1']}. The four plots on top have been generated with predictions corresponding to the Spider dataset and four plots below, with the BIRD dataset. A well-calibrated plot aligns closely with the x=y line. Each point is color-coded based on the number of samples in the bin, as indicated by the colorbar on the right.
  • Figure 3: The plots have been generated using platt scaling in place of isotonic scaling used in \ref{['fig:plot-1']}. The four plots on top have been generated with predictions corresponding to the Spider dataset and four plots below, with the BIRD dataset. A well-calibrated plot aligns closely with the x=y line. Each point is color-coded based on the number of samples in the bin, as indicated by the colorbar on the right.
  • Figure 4: The reliability plots continued from \ref{['fig:plot-1']} to illustrate the calibration comparison between the different whole query methods. The four plots on top have been generated with predictions corresponding to the Spider dataset and four plots below, with the BIRD dataset. A well-calibrated plot aligns closely with the x=y line. Each point is color-coded based on the number of samples in the bin, as indicated by the colorbar on the right.
  • Figure 5: The reliability plots continued from \ref{['fig:plot-1']} to illustrate the calibration comparison between the different whole query methods. The four plots on top have been generated with predictions corresponding to the Spider dataset and four plots below, with the BIRD dataset. A well-calibrated plot aligns closely with the x=y line. Each point is color-coded based on the number of samples in the bin, as indicated by the colorbar on the right.