Table of Contents
Fetching ...

MSc-SQL: Multi-Sample Critiquing Small Language Models For Text-To-SQL Translation

Satya Krishna Gorti, Ilan Gofman, Zhaoyan Liu, Jiapeng Wu, Noël Vouitsis, Guangwei Yu, Jesse C. Cresswell, Rasa Hosseinzadeh

TL;DR

The paper addresses the need for accessible, privacy-conscious text-to-SQL systems without relying on closed models. It introduces MSc-SQL, a pipeline that combines schema linking, retrieval-augmented SQL generation, and multi-sample critiquing to select the best among several candidate queries from small open-source LLMs. The critiquing component jointly reasons over multiple samples and their execution results, enabling competitive performance on the Spider and BIRD benchmarks at a fraction of GPT-4-based costs. Extensive ablations show that sample diversity and QLoRA-based fine-tuning are key to achieving strong results, with practical implications for latency-sensitive and privacy-preserving applications.

Abstract

Text-to-SQL generation enables non-experts to interact with databases via natural language. Recent advances rely on large closed-source models like GPT-4 that present challenges in accessibility, privacy, and latency. To address these issues, we focus on developing small, efficient, and open-source text-to-SQL models. We demonstrate the benefits of sampling multiple candidate SQL generations and propose our method, MSc-SQL, to critique them using associated metadata. Our sample critiquing model evaluates multiple outputs simultaneously, achieving state-of-the-art performance compared to other open-source models while remaining competitive with larger models at a much lower cost. Full code can be found at https://github.com/layer6ai-labs/msc-sql.

MSc-SQL: Multi-Sample Critiquing Small Language Models For Text-To-SQL Translation

TL;DR

The paper addresses the need for accessible, privacy-conscious text-to-SQL systems without relying on closed models. It introduces MSc-SQL, a pipeline that combines schema linking, retrieval-augmented SQL generation, and multi-sample critiquing to select the best among several candidate queries from small open-source LLMs. The critiquing component jointly reasons over multiple samples and their execution results, enabling competitive performance on the Spider and BIRD benchmarks at a fraction of GPT-4-based costs. Extensive ablations show that sample diversity and QLoRA-based fine-tuning are key to achieving strong results, with practical implications for latency-sensitive and privacy-preserving applications.

Abstract

Text-to-SQL generation enables non-experts to interact with databases via natural language. Recent advances rely on large closed-source models like GPT-4 that present challenges in accessibility, privacy, and latency. To address these issues, we focus on developing small, efficient, and open-source text-to-SQL models. We demonstrate the benefits of sampling multiple candidate SQL generations and propose our method, MSc-SQL, to critique them using associated metadata. Our sample critiquing model evaluates multiple outputs simultaneously, achieving state-of-the-art performance compared to other open-source models while remaining competitive with larger models at a much lower cost. Full code can be found at https://github.com/layer6ai-labs/msc-sql.

Paper Structure

This paper contains 16 sections, 1 equation, 3 figures, 6 tables, 1 algorithm.

Figures (3)

  • Figure 1: Starting with a natural language query $q$, database schema $\mathcal{S}$, and metadata $\mathcal{M}_{\text{link}}$, the schema linking model returns a subset $\mathcal{S}_{q}$ of tables which are necessary to answer $q$. Next, the SQL generation model adds metadata $\mathcal{M}_{\text{gen}}$ obtained through retrieval against an embedding of the query $e(q)$, and generates multiple possible SQL queries $s_i$. Finally, the multi-sample critiquing model comparatively evaluates the generations $s_i$ along with their execution results $r_i$ when run on the database, and then selects one as the final output $s$.
  • Figure 2: Two example queries sampled from our SQL generation model. Both are given to MSc-SQL for critiquing; one is correct and one is incorrect. Joint reasoning over both queries allows MSc-SQL to better capture the nuanced differences between them and thus select the correct query.
  • Figure 3: Effect of using different models to each create one sample for multi-sample critiquing. The generation models are all fine-tuned Mistral-7B models, but with different random seeds used during training.