Table of Contents
Fetching ...

Query and Conquer: Execution-Guided SQL Generation

Łukasz Borchmann, Marek Wydmuch

TL;DR

The paper tackles the gap between pass@k and pass@1 in text-to-SQL by introducing execution-guided self-consistency, which uses execution results to compare candidate queries within a Minimum Bayes Risk (MBR) framework. It differentiates exact execution similarity, based on matching outputs, from approximate similarity via execution plans, and extends the approach to partially executable SQL with PipeSQL and a patience mechanism to refine intermediate steps. Empirically, the method yields significant accuracy gains across a spectrum of models, with small 3B–7B models approaching the performance of heavier reasoning systems at markedly reduced inference cost (up to 30x cheaper). The approach generalizes beyond SQL to broader code-generation tasks, offering a scalable, exchangeable pathway to robust, cost-efficient program synthesis.

Abstract

We propose a novel approach for generating complex outputs that significantly improves accuracy in text-to-SQL tasks. Our method leverages execution results to select the most semantically consistent query from multiple candidates, enabling smaller, cost-effective models to surpass computationally intensive reasoning methods such as o1, o3-mini, and DeepSeek R1 while reducing inference cost by as much as 30 times. It integrates effortlessly with existing models, offering a practical and scalable pathway to state-of-the-art SQL generation.

Query and Conquer: Execution-Guided SQL Generation

TL;DR

The paper tackles the gap between pass@k and pass@1 in text-to-SQL by introducing execution-guided self-consistency, which uses execution results to compare candidate queries within a Minimum Bayes Risk (MBR) framework. It differentiates exact execution similarity, based on matching outputs, from approximate similarity via execution plans, and extends the approach to partially executable SQL with PipeSQL and a patience mechanism to refine intermediate steps. Empirically, the method yields significant accuracy gains across a spectrum of models, with small 3B–7B models approaching the performance of heavier reasoning systems at markedly reduced inference cost (up to 30x cheaper). The approach generalizes beyond SQL to broader code-generation tasks, offering a scalable, exchangeable pathway to robust, cost-efficient program synthesis.

Abstract

We propose a novel approach for generating complex outputs that significantly improves accuracy in text-to-SQL tasks. Our method leverages execution results to select the most semantically consistent query from multiple candidates, enabling smaller, cost-effective models to surpass computationally intensive reasoning methods such as o1, o3-mini, and DeepSeek R1 while reducing inference cost by as much as 30 times. It integrates effortlessly with existing models, offering a practical and scalable pathway to state-of-the-art SQL generation.

Paper Structure

This paper contains 39 sections, 6 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Cost-accuracy analysis for Qwen 2.5 Coder 7B, with or without self-consistency (10-20 samples), compared alongside OpenAI models.
  • Figure 2: Execution-Guided SQL Generation.
  • Figure 3: PipeSQL dialect has a property that each query prefix (up to the pipe sequence |>) is also a valid query, making it possible to apply execution-based self-consistency in the middle of the generation process. Instead of sampling $n$ complete SQL sequences, we sample $n$ pipes and stop the generation process. Then, we pick the most consistent pipe and continue the generation sampling $n$ variants of the next pipe.
  • Figure 4: Self-consistency gains for various sample sizes, temperatures, and models (Gemini 2.0 Flash, Llama 3.3 70B, Codestral, Qwen 2.5 Coder 7B).
  • Figure 5: Effect of replacing outputs produced under greedy decoding by self-consistency outputs. Valid and invalid refer to executability, whereas correct and incorrect---conforming to the gold standard.
  • ...and 3 more figures