Table of Contents
Fetching ...

CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL

Mohammadreza Pourreza, Hailong Li, Ruoxi Sun, Yeounoh Chung, Shayan Talaei, Gaurav Tarlok Kakkar, Yu Gan, Amin Saberi, Fatma Ozcan, Sercan O. Arik

TL;DR

CHASE-SQL introduces a test-time, multi-agent framework for Text-to-SQL that generates diverse SQL candidates using three reasoning strategies (Divide-and-Conquer CoT, Query Plan CoT, and Online Synthetic Example Generation) and selects the best candidate with a fine-tuned binary Selection Agent trained on pairwise comparisons. The approach is complemented by value retrieval via LSH-based keyword extraction and a query fixer for iterative corrections, forming an ensemble that significantly outperforms previous methods on BIRD (71+% EX) and Spider (87.6% EX) benchmarks. Key contributions include a robust candidate-generation suite, an effective fixer, and a pairwise selection mechanism that surpasses self-consistency baselines, achieving state-of-the-art results at submission. The results demonstrate the value of test-time computation and ensemble reasoning for complex Text-to-SQL tasks, with strong generalization to unseen domains. CHASE-SQL thus provides a practical framework for deploying high-accuracy Text-to-SQL systems in real-world databases.

Abstract

In tackling the challenges of large language model (LLM) performance for Text-to-SQL tasks, we introduce CHASE-SQL, a new framework that employs innovative strategies, using test-time compute in multi-agent modeling to improve candidate generation and selection. CHASE-SQL leverages LLMs' intrinsic knowledge to generate diverse and high-quality SQL candidates using different LLM generators with: (1) a divide-and-conquer method that decomposes complex queries into manageable sub-queries in a single LLM call; (2) chain-of-thought reasoning based on query execution plans, reflecting the steps a database engine takes during execution; and (3) a unique instance-aware synthetic example generation technique, which offers specific few-shot demonstrations tailored to test questions.To identify the best candidate, a selection agent is employed to rank the candidates through pairwise comparisons with a fine-tuned binary-candidates selection LLM. This selection approach has been demonstrated to be more robust over alternatives. The proposed generators-selector framework not only enhances the quality and diversity of SQL queries but also outperforms previous methods. Overall, our proposed CHASE-SQL achieves the state-of-the-art execution accuracy of 73.0% and 73.01% on the test set and development set of the notable BIRD Text-to-SQL dataset benchmark, rendering CHASE-SQL the top submission of the leaderboard (at the time of paper submission).

CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL

TL;DR

CHASE-SQL introduces a test-time, multi-agent framework for Text-to-SQL that generates diverse SQL candidates using three reasoning strategies (Divide-and-Conquer CoT, Query Plan CoT, and Online Synthetic Example Generation) and selects the best candidate with a fine-tuned binary Selection Agent trained on pairwise comparisons. The approach is complemented by value retrieval via LSH-based keyword extraction and a query fixer for iterative corrections, forming an ensemble that significantly outperforms previous methods on BIRD (71+% EX) and Spider (87.6% EX) benchmarks. Key contributions include a robust candidate-generation suite, an effective fixer, and a pairwise selection mechanism that surpasses self-consistency baselines, achieving state-of-the-art results at submission. The results demonstrate the value of test-time computation and ensemble reasoning for complex Text-to-SQL tasks, with strong generalization to unseen domains. CHASE-SQL thus provides a practical framework for deploying high-accuracy Text-to-SQL systems in real-world databases.

Abstract

In tackling the challenges of large language model (LLM) performance for Text-to-SQL tasks, we introduce CHASE-SQL, a new framework that employs innovative strategies, using test-time compute in multi-agent modeling to improve candidate generation and selection. CHASE-SQL leverages LLMs' intrinsic knowledge to generate diverse and high-quality SQL candidates using different LLM generators with: (1) a divide-and-conquer method that decomposes complex queries into manageable sub-queries in a single LLM call; (2) chain-of-thought reasoning based on query execution plans, reflecting the steps a database engine takes during execution; and (3) a unique instance-aware synthetic example generation technique, which offers specific few-shot demonstrations tailored to test questions.To identify the best candidate, a selection agent is employed to rank the candidates through pairwise comparisons with a fine-tuned binary-candidates selection LLM. This selection approach has been demonstrated to be more robust over alternatives. The proposed generators-selector framework not only enhances the quality and diversity of SQL queries but also outperforms previous methods. Overall, our proposed CHASE-SQL achieves the state-of-the-art execution accuracy of 73.0% and 73.01% on the test set and development set of the notable BIRD Text-to-SQL dataset benchmark, rendering CHASE-SQL the top submission of the leaderboard (at the time of paper submission).
Paper Structure (44 sections, 1 equation, 25 figures, 8 tables, 3 algorithms)

This paper contains 44 sections, 1 equation, 25 figures, 8 tables, 3 algorithms.

Figures (25)

  • Figure 1: Overview of the proposed CHASE-SQL framework for Text-to-SQL, with value retrieval and using a selection agent for improve picking of the answers among the generated candidates along with a fixer to provide feedback for refinement of the outputs.
  • Figure 2: Comparison of the upper- and lower-bound performance of different candidate generators.
  • Figure 3: Comparison of SQL generation methods: Venn diagram showing unique and overlapping correct answers (left) and the performance across different complexity levels (right).
  • Figure 4: Number of correct queries by each method across different databases of BIRD development set.
  • Figure 5: Distribution of system performance based on the final answer correctness. The chart shows the proportion of correct final answers, correct queries existing among candidates but not chosen (wrong selection), no correct candidate cases, and cases were the golden SQL query is wrong.
  • ...and 20 more figures