Do LLMs Align with My Task? Evaluating Text-to-SQL via Dataset Alignment
Davood Rafiei, Morgan Lindsay Heisler, Weiwei Zhang, Mohammadreza Pourreza, Yong Zhang
TL;DR
The paper investigates how dataset alignment between SFT data and target NL2SQL queries governs fine-tuning success. By defining KL-alignment over structural SQL templates and introducing an Alignment Ratio, it demonstrates that high alignment predicts substantial gains in execution and exact-match SQL accuracy across cross-domain benchmarks and model families. The study provides a predictive framework for selecting in-domain training data and warns against overfitting to a single domain, which can harm cross-domain generalization. It also shows that small samples can guide alignment-based data selection, enabling practical domain adaptation in real-world settings. Overall, alignment-aware data selection emerges as a critical lever for improving NL2SQL transfer learning and generalization.
Abstract
Supervised Fine-Tuning (SFT) is an effective method for adapting Large Language Models (LLMs) on downstream tasks. However, variability in training data can hinder a model's ability to generalize across domains. This paper studies the problem of dataset alignment for Natural Language to SQL (NL2SQL or text to SQL), examining how well SFT training data matches the structural characteristics of target queries and how this alignment impacts model performance. We hypothesize that alignment can be accurately estimated by comparing the distributions of structural SQL features across the training set, target data, and the model's predictions prior to SFT. Through comprehensive experiments on three large cross-domain NL2SQL benchmarks and multiple model families, we show that structural alignment is a strong predictor of fine-tuning success. When alignment is high, SFT yields substantial gains in accuracy and SQL generation quality; when alignment is low, improvements are marginal or absent. These findings highlight the importance of alignment-aware data selection for effective fine-tuning and generalization in NL2SQL tasks.
