Table of Contents
Fetching ...

An Efficient and Effective Evaluator for Text2SQL Models on Unseen and Unlabeled Data

Trinh Pham, Thanh Tam Nguyen, Viet Huynh, Hongzhi Yin, Quoc Viet Hung Nguyen

TL;DR

Experiments show that FusionSQL closely follows actual accuracy and reliably signals emerging issues, allowing teams to measure quality on unseen and unlabeled datasets.

Abstract

Recent advances in large language models has strengthened Text2SQL systems that translate natural language questions into database queries. A persistent deployment challenge is to assess a newly trained Text2SQL system on an unseen and unlabeled dataset when no verified answers are available. This situation arises frequently because database content and structure evolve, privacy policies slow manual review, and carefully written SQL labels are costly and time-consuming. Without timely evaluation, organizations cannot approve releases or detect failures early. FusionSQL addresses this gap by working with any Text2SQL models and estimating accuracy without reference labels, allowing teams to measure quality on unseen and unlabeled datasets. It analyzes patterns in the system's own outputs to characterize how the target dataset differs from the material used during training. FusionSQL supports pre-release checks, continuous monitoring of new databases, and detection of quality decline. Experiments across diverse application settings and question types show that FusionSQL closely follows actual accuracy and reliably signals emerging issues. Our code is available at https://github.com/phkhanhtrinh23/FusionSQL.

An Efficient and Effective Evaluator for Text2SQL Models on Unseen and Unlabeled Data

TL;DR

Experiments show that FusionSQL closely follows actual accuracy and reliably signals emerging issues, allowing teams to measure quality on unseen and unlabeled datasets.

Abstract

Recent advances in large language models has strengthened Text2SQL systems that translate natural language questions into database queries. A persistent deployment challenge is to assess a newly trained Text2SQL system on an unseen and unlabeled dataset when no verified answers are available. This situation arises frequently because database content and structure evolve, privacy policies slow manual review, and carefully written SQL labels are costly and time-consuming. Without timely evaluation, organizations cannot approve releases or detect failures early. FusionSQL addresses this gap by working with any Text2SQL models and estimating accuracy without reference labels, allowing teams to measure quality on unseen and unlabeled datasets. It analyzes patterns in the system's own outputs to characterize how the target dataset differs from the material used during training. FusionSQL supports pre-release checks, continuous monitoring of new databases, and detection of quality decline. Experiments across diverse application settings and question types show that FusionSQL closely follows actual accuracy and reliably signals emerging issues. Our code is available at https://github.com/phkhanhtrinh23/FusionSQL.
Paper Structure (16 sections, 16 equations, 11 figures, 7 tables)

This paper contains 16 sections, 16 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Top: Existing Text2SQL evaluations rely on ground-truth labels, which are often unavailable as databases evolve. Bottom: FusionSQL estimates model accuracy directly from unlabeled inputs without requiring ground-truth SQL labels.
  • Figure 2: FusionSQL framework.Training: A frozen Text2SQL model encodes training and FusionDataset samples into embeddings to compute shift descriptors ($\textit{SD}_F$, $\textit{SD}_M$, $\textit{SD}_{SW}$) for training the FusionSQL evaluator. Inference: For unseen, unlabeled workloads, the same descriptors are computed to estimate accuracy without labels or retraining.
  • Figure 3: t-SNE coverage. Comparing 50K samples, FusionDataset bridges clusters of existing benchmarks in both domain (a) and question space (b), reflecting broader semantic and structural diversity of real-world Text2SQL variability.
  • Figure 4: Structural coverage. Radar chart over 16 schemas and SQL complexity dimensions shows that FusionDataset (red) consistently achieves higher normalized coverage than existing datasets, reflecting diverse schema and query structures.
  • Figure 5: Sample-set size. Impact of sample-set size $|S_i|$ used to compute distribution shifts for an instance $(\mathcal{D}_{\mathrm{train}},S_i)$. Error decreases as $|S_i|$ grows.
  • ...and 6 more figures