Table of Contents
Fetching ...

ExCoT: Optimizing Reasoning for Text-to-SQL with Execution Feedback

Bohan Zhai, Canwen Xu, Yuxiong He, Zhewei Yao

TL;DR

The paper addresses the challenge of leveraging chain-of-thought reasoning for text-to-SQL. It introduces ExCoT, which couples CoT with off-policy and on-policy Direct Preference Optimization, guided only by execution accuracy. Empirical results on the BIRD and Spider benchmarks show state-of-the-art single-model performance and robust gains from the staged training regime. The approach reduces reliance on reward models or human annotations and demonstrates potential applicability to other structured generation tasks such as code synthesis. Overall, ExCoT presents a scalable, self-guided pathway to improve reasoning-driven SQL generation with open-source models.

Abstract

Text-to-SQL demands precise reasoning to convert natural language questions into structured queries. While large language models (LLMs) excel in many reasoning tasks, their ability to leverage Chain-of-Thought (CoT) reasoning for text-to-SQL remains underexplored. We identify critical limitations: zero-shot CoT offers minimal gains, and Direct Preference Optimization (DPO) applied without CoT yields marginal improvements. We propose ExCoT, a novel framework that iteratively optimizes open-source LLMs by combining CoT reasoning with off-policy and on-policy DPO, relying solely on execution accuracy as feedback. This approach eliminates the need for reward models or human-annotated preferences. Our experimental results demonstrate significant performance gains: ExCoT improves execution accuracy on BIRD dev set from 57.37% to 68.51% and on Spider test set from 78.81% to 86.59% for LLaMA-3 70B, with Qwen-2.5-Coder demonstrating similar improvements. Our best model achieves state-of-the-art performance in the single-model setting on both BIRD and Spider datasets, notably achieving 68.53% on the BIRD test set.

ExCoT: Optimizing Reasoning for Text-to-SQL with Execution Feedback

TL;DR

The paper addresses the challenge of leveraging chain-of-thought reasoning for text-to-SQL. It introduces ExCoT, which couples CoT with off-policy and on-policy Direct Preference Optimization, guided only by execution accuracy. Empirical results on the BIRD and Spider benchmarks show state-of-the-art single-model performance and robust gains from the staged training regime. The approach reduces reliance on reward models or human annotations and demonstrates potential applicability to other structured generation tasks such as code synthesis. Overall, ExCoT presents a scalable, self-guided pathway to improve reasoning-driven SQL generation with open-source models.

Abstract

Text-to-SQL demands precise reasoning to convert natural language questions into structured queries. While large language models (LLMs) excel in many reasoning tasks, their ability to leverage Chain-of-Thought (CoT) reasoning for text-to-SQL remains underexplored. We identify critical limitations: zero-shot CoT offers minimal gains, and Direct Preference Optimization (DPO) applied without CoT yields marginal improvements. We propose ExCoT, a novel framework that iteratively optimizes open-source LLMs by combining CoT reasoning with off-policy and on-policy DPO, relying solely on execution accuracy as feedback. This approach eliminates the need for reward models or human-annotated preferences. Our experimental results demonstrate significant performance gains: ExCoT improves execution accuracy on BIRD dev set from 57.37% to 68.51% and on Spider test set from 78.81% to 86.59% for LLaMA-3 70B, with Qwen-2.5-Coder demonstrating similar improvements. Our best model achieves state-of-the-art performance in the single-model setting on both BIRD and Spider datasets, notably achieving 68.53% on the BIRD test set.

Paper Structure

This paper contains 42 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The workflow of ExCoT. (1) We use a well-designed prompt to obtain candidate data from a LLM (GPT-4o is used in our experiment). We execute extracted SQLs on a local SQLite instance and compare the results with the ground truth. We use the positive examples to supervised fine-tune (SFT) the base model and construct the pairs for off-policy DPO. (2) We use the model trained with off-policy DPO to generate new candidate CoT data for on-policy DPO. We repeat this process iteratively for multiple rounds.
  • Figure 2: Number of valid data pairs and corresponding execution accuracy on BIRD across successive training stages. Although the pool of valid preference pairs decreases after off-policy DPO, each additional on-policy round of iterative DPO continues to boost execution accuracy, demonstrating that smaller yet targeted sets of self-generated examples effectively refine the model’s reasoning and SQL generation capabilities.
  • Figure 3: Number of CoT tokens across different training stages.