Table of Contents
Fetching ...

STaR-SQL: Self-Taught Reasoner for Text-to-SQL

Mingqian He, Yongliang Shen, Wenqi Zhang, Qiuying Peng, Jun Wang, Weiming Lu

TL;DR

STaR-SQL addresses the challenge of text-to-SQL on complex, cross-domain databases by reframing SQL generation as a reasoning task. It builds a self-taught reasoning loop that bootstraps high-quality rationales through iterative fine-tuning and uses an Outcome-supervised Reward Model (ORM) with best-of-$N$ sampling to verify and select results at test time. On the Spider benchmark, STaR-SQL achieves an execution accuracy of $86.6\%$, surpassing few-shot baselines and GPT-4 prompting, and outperforming several state-of-the-art methods that rely on heavy prompts or closed models. The work demonstrates the practical potential of reasoning-augmented, test-time scalable approaches for structured tasks like text-to-SQL, with implications for extending self-improving reasoning to other data-to-SQL and structured reasoning problems.

Abstract

Generating step-by-step "chain-of-thought" rationales has proven effective for improving the performance of large language models on complex reasoning tasks. However, applying such techniques to structured tasks, such as text-to-SQL, remains largely unexplored. In this paper, we introduce Self-Taught Reasoner for text-to-SQL (STaR-SQL), a novel approach that reframes SQL query generation as a reasoning-driven process. Our method prompts the LLM to produce detailed reasoning steps for SQL queries and fine-tunes it on rationales that lead to correct outcomes. Unlike traditional methods, STaR-SQL dedicates additional test-time computation to reasoning, thereby positioning LLMs as spontaneous reasoners rather than mere prompt-based agents. To further scale the inference process, we incorporate an outcome-supervised reward model (ORM) as a verifier, which enhances SQL query accuracy. Experimental results on the challenging Spider benchmark demonstrate that STaR-SQL significantly improves text-to-SQL performance, achieving an execution accuracy of 86.6%. This surpasses a few-shot baseline by 31.6% and a baseline fine-tuned to predict answers directly by 18.0%. Additionally, STaR-SQL outperforms agent-like prompting methods that leverage more powerful yet closed-source models such as GPT-4. These findings underscore the potential of reasoning-augmented training for structured tasks and open the door to extending self-improving reasoning models to text-to-SQL generation and beyond.

STaR-SQL: Self-Taught Reasoner for Text-to-SQL

TL;DR

STaR-SQL addresses the challenge of text-to-SQL on complex, cross-domain databases by reframing SQL generation as a reasoning task. It builds a self-taught reasoning loop that bootstraps high-quality rationales through iterative fine-tuning and uses an Outcome-supervised Reward Model (ORM) with best-of- sampling to verify and select results at test time. On the Spider benchmark, STaR-SQL achieves an execution accuracy of , surpassing few-shot baselines and GPT-4 prompting, and outperforming several state-of-the-art methods that rely on heavy prompts or closed models. The work demonstrates the practical potential of reasoning-augmented, test-time scalable approaches for structured tasks like text-to-SQL, with implications for extending self-improving reasoning to other data-to-SQL and structured reasoning problems.

Abstract

Generating step-by-step "chain-of-thought" rationales has proven effective for improving the performance of large language models on complex reasoning tasks. However, applying such techniques to structured tasks, such as text-to-SQL, remains largely unexplored. In this paper, we introduce Self-Taught Reasoner for text-to-SQL (STaR-SQL), a novel approach that reframes SQL query generation as a reasoning-driven process. Our method prompts the LLM to produce detailed reasoning steps for SQL queries and fine-tunes it on rationales that lead to correct outcomes. Unlike traditional methods, STaR-SQL dedicates additional test-time computation to reasoning, thereby positioning LLMs as spontaneous reasoners rather than mere prompt-based agents. To further scale the inference process, we incorporate an outcome-supervised reward model (ORM) as a verifier, which enhances SQL query accuracy. Experimental results on the challenging Spider benchmark demonstrate that STaR-SQL significantly improves text-to-SQL performance, achieving an execution accuracy of 86.6%. This surpasses a few-shot baseline by 31.6% and a baseline fine-tuned to predict answers directly by 18.0%. Additionally, STaR-SQL outperforms agent-like prompting methods that leverage more powerful yet closed-source models such as GPT-4. These findings underscore the potential of reasoning-augmented training for structured tasks and open the door to extending self-improving reasoning models to text-to-SQL generation and beyond.

Paper Structure

This paper contains 20 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: A comparison of different text-to-SQL methods: Traditional PLM-based methods focus on how to encode the schema (e.g., RATSQL wang2019rat). Current LLM-based methods employ carefully designed prompts and subtask flows to simplify and understand the task, functioning in an agent-like manner and using many tokens in the prompt (e.g., DIN-SQL pourreza2024din). We treat text-to-SQL as a reasoning-driven process. By leveraging the LLM’s existing reasoning capabilities, we iteratively bootstrap its ability to generate high-quality rationales. In addition, by allocating more test-time computation, we further improve the reliability of the process.
  • Figure 2: An overview of the STaR-SQL framework. It consists of three main steps: step-by-step rationale generation for self-improvement, verifier training, and test-time verification. We transform text-to-SQL into a reasoning task and further explore scaling up test-time computation by incorporating a verifier and employing best-of-N sampling.
  • Figure 3: Execution accuracy comparison across different query difficulty levels on the Spider development set.
  • Figure 4: Performance of STaR-SQL with varying numbers of solutions (N).
  • Figure 5: A case study from the Spider dev set.