Text-to-SQL Domain Adaptation via Human-LLM Collaborative Data Annotation
Yuan Tian, Daniel Lee, Fei Wu, Tung Mai, Kun Qian, Siddhartha Sahai, Tianyi Zhang, Yunyao Li
TL;DR
Domain-shift and data scarcity hinder practical deployment of text-to-SQL systems on new schemas. The authors introduce SQLsynth, an interactive, human-in-the-loop annotation system that combines a PCFG-based SQL sampler, LLM-assisted NL generation, step-by-step explanations, alignment-based error repair, and dataset-diversity analysis, all wrapped in an extensible UI with schema visualization. A within-subjects study with 12 participants shows SQLsynth dramatically increases annotation throughput while reducing errors, improving naturalness, and enhancing diversity compared with manual annotation or a ChatGPT-only workflow. The work demonstrates that structured human-LLM collaboration can produce high-quality, schema-specific NL-to-SQL datasets efficiently, enabling focused model fine-tuning and robust domain evaluation for real-world deployments.
Abstract
Text-to-SQL models, which parse natural language (NL) questions to executable SQL queries, are increasingly adopted in real-world applications. However, deploying such models in the real world often requires adapting them to the highly specialized database schemas used in specific applications. We find that existing text-to-SQL models experience significant performance drops when applied to new schemas, primarily due to the lack of domain-specific data for fine-tuning. This data scarcity also limits the ability to effectively evaluate model performance in new domains. Continuously obtaining high-quality text-to-SQL data for evolving schemas is prohibitively expensive in real-world scenarios. To bridge this gap, we propose SQLsynth, a human-in-the-loop text-to-SQL data annotation system. SQLsynth streamlines the creation of high-quality text-to-SQL datasets through human-LLM collaboration in a structured workflow. A within-subjects user study comparing SQLsynth with manual annotation and ChatGPT shows that SQLsynth significantly accelerates text-to-SQL data annotation, reduces cognitive load, and produces datasets that are more accurate, natural, and diverse. Our code is available at https://github.com/magic-YuanTian/SQLsynth.
