Thinkquel: A Model Dedicated to Text-to-dbt Using Synthetic Data and a Span-Aware Objective
Anni Li, Aria Attar, Paul Dong
TL;DR
Thinkquel addresses the difficulty of producing production-ready dbt transformations from natural language by combining a scalable synthetic NL+dbt data pipeline with a span-aware reinforcement learning objective. The approach pairs a two-stage supervised fine-tuning curriculum with Token–Sequence GRPO (TS–GRPO), which separates planning (token-level) from the production dbt/SQL output (sequence-level), to align optimization with execution-based supervision. Empirical results show strong in-domain performance on TS-SQL and Spider, with robust stability and competitive out-of-domain results on BIRD-dbt, driven by planning-aware training and span-based credit routing. These contributions advance portable, execution-validated text-to-dbt generation and suggest future work on broader warehouse portability and tool-assisted verification.
Abstract
Transforming natural-language requests into reliable, production-ready data transformations remains challenging: correctness depends on precise schema linking and warehouse-specific SQL dialects, while the strongest supervision available during training--execution success and result matching--are provided only at the sequence level. At the same time, assembling large, execution-validated corpora is costly, and token-level objectives misalign with these global signals, yielding unstable optimization and limited portability. We introduce Thinkquel, a fine-tuned model for producing robust, portable, and execution-validated database queries. Methodologies in Thinkquel integrates a novel synthetic data pipeline, TS-SQL, that leverages dbt as a portable intermediate representation with a span-aware reinforcement learning objective, and Token-Sequence GRPO (TS-GRPO), specifically designed to bridge the gap between token-level training signals and sequence-level execution rewards when finetuning LLMs. On the 500-example TS-SQL test set, Thinkquel (32B) reaches 93.2% execution success and 61.8% exact-result match with a two-stage SFT curriculum, improving over the base model by 67.2% (exec.) and 44.4% (match). In Spider (14B) experiments, TS-GRPO increases training stability and speeds convergence of the execution-match reward relative to GRPO and GSPO.
