Table of Contents
Fetching ...

Do LLMs Align with My Task? Evaluating Text-to-SQL via Dataset Alignment

Davood Rafiei, Morgan Lindsay Heisler, Weiwei Zhang, Mohammadreza Pourreza, Yong Zhang

TL;DR

The paper investigates how dataset alignment between SFT data and target NL2SQL queries governs fine-tuning success. By defining KL-alignment over structural SQL templates and introducing an Alignment Ratio, it demonstrates that high alignment predicts substantial gains in execution and exact-match SQL accuracy across cross-domain benchmarks and model families. The study provides a predictive framework for selecting in-domain training data and warns against overfitting to a single domain, which can harm cross-domain generalization. It also shows that small samples can guide alignment-based data selection, enabling practical domain adaptation in real-world settings. Overall, alignment-aware data selection emerges as a critical lever for improving NL2SQL transfer learning and generalization.

Abstract

Supervised Fine-Tuning (SFT) is an effective method for adapting Large Language Models (LLMs) on downstream tasks. However, variability in training data can hinder a model's ability to generalize across domains. This paper studies the problem of dataset alignment for Natural Language to SQL (NL2SQL or text to SQL), examining how well SFT training data matches the structural characteristics of target queries and how this alignment impacts model performance. We hypothesize that alignment can be accurately estimated by comparing the distributions of structural SQL features across the training set, target data, and the model's predictions prior to SFT. Through comprehensive experiments on three large cross-domain NL2SQL benchmarks and multiple model families, we show that structural alignment is a strong predictor of fine-tuning success. When alignment is high, SFT yields substantial gains in accuracy and SQL generation quality; when alignment is low, improvements are marginal or absent. These findings highlight the importance of alignment-aware data selection for effective fine-tuning and generalization in NL2SQL tasks.

Do LLMs Align with My Task? Evaluating Text-to-SQL via Dataset Alignment

TL;DR

The paper investigates how dataset alignment between SFT data and target NL2SQL queries governs fine-tuning success. By defining KL-alignment over structural SQL templates and introducing an Alignment Ratio, it demonstrates that high alignment predicts substantial gains in execution and exact-match SQL accuracy across cross-domain benchmarks and model families. The study provides a predictive framework for selecting in-domain training data and warns against overfitting to a single domain, which can harm cross-domain generalization. It also shows that small samples can guide alignment-based data selection, enabling practical domain adaptation in real-world settings. Overall, alignment-aware data selection emerges as a critical lever for improving NL2SQL transfer learning and generalization.

Abstract

Supervised Fine-Tuning (SFT) is an effective method for adapting Large Language Models (LLMs) on downstream tasks. However, variability in training data can hinder a model's ability to generalize across domains. This paper studies the problem of dataset alignment for Natural Language to SQL (NL2SQL or text to SQL), examining how well SFT training data matches the structural characteristics of target queries and how this alignment impacts model performance. We hypothesize that alignment can be accurately estimated by comparing the distributions of structural SQL features across the training set, target data, and the model's predictions prior to SFT. Through comprehensive experiments on three large cross-domain NL2SQL benchmarks and multiple model families, we show that structural alignment is a strong predictor of fine-tuning success. When alignment is high, SFT yields substantial gains in accuracy and SQL generation quality; when alignment is low, improvements are marginal or absent. These findings highlight the importance of alignment-aware data selection for effective fine-tuning and generalization in NL2SQL tasks.

Paper Structure

This paper contains 32 sections, 2 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Execution accuracy of various models on the Gretel test set before and after supervised fine-tuning (SFT) with different datasets. The graph highlights performance variability, showing instances of accuracy improvement and degradation across the datasets and potential benefits of post-SFT performance prediction.
  • Figure 2: Correlation between KL-Alignment and a) Execution Accuracy and b) Exact Match Accuracy for base model outputs. Higher KL-Alignment generally corresponds to improved execution accuracy across model families.
  • Figure 3: Predictive nature of alignment ratio (AR): Datasets with AR $>$ 1 generally show accuracy improvement after SFT, while those with AR $<$ 1 exhibit similar or decreased accuracy. The colour bar at the bottom of the figure highlights better (dark green) and poorer (dark red) alignment ratios.
  • Figure 4: Abstract syntax tree (AST) of the given SQL query