Table of Contents
Fetching ...

Analyzing the Effectiveness of Large Language Models on Text-to-SQL Synthesis

Richard Roberson, Gowtham Kaki, Ashutosh Trivedi

TL;DR

This work evaluates large language models for Text-to-SQL on the Spider benchmark, comparing open-source fine-tuning of WizardCoder with closed-source GPT-3.5/4 approaches. Open models achieve around $61\%$ execution accuracy, while a correction-driven pipeline using fine-tuning plus Example Driven and Error Driven Corrections reaches up to $82.1\%$ execution accuracy. The study identifies seven primary error types (e.g., SELECT correctness, GROUP BY, value prediction, aggregates, JOINs) and reveals substantial inconsistencies and gaps in the Spider dataset that can misclassify correct queries. Overall, the results demonstrate the potential of correction-based strategies to boost Text-to-SQL performance, while highlighting the need for more nuanced evaluation of SQL correctness and dataset quality.

Abstract

This study investigates various approaches to using Large Language Models (LLMs) for Text-to-SQL program synthesis, focusing on the outcomes and insights derived. Employing the popular Text-to-SQL dataset, spider, the goal was to input a natural language question along with the database schema and output the correct SQL SELECT query. The initial approach was to fine-tune a local and open-source model to generate the SELECT query. After QLoRa fine-tuning WizardLM's WizardCoder-15B model on the spider dataset, the execution accuracy for generated queries rose to a high of 61%. With the second approach, using the fine-tuned gpt-3.5-turbo-16k (Few-shot) + gpt-4-turbo (Zero-shot error correction), the execution accuracy reached a high of 82.1%. Of all the incorrect queries, most can be categorized into a seven different categories of what went wrong: selecting the wrong columns or wrong order of columns, grouping by the wrong column, predicting the wrong values in conditionals, using different aggregates than the ground truth, extra or too few JOIN clauses, inconsistencies in the Spider dataset, and lastly completely incorrect query structure. Most if not all of the queries fall into these categories and it is insightful to understanding where the faults still lie with LLM program synthesis and where they can be improved.

Analyzing the Effectiveness of Large Language Models on Text-to-SQL Synthesis

TL;DR

This work evaluates large language models for Text-to-SQL on the Spider benchmark, comparing open-source fine-tuning of WizardCoder with closed-source GPT-3.5/4 approaches. Open models achieve around execution accuracy, while a correction-driven pipeline using fine-tuning plus Example Driven and Error Driven Corrections reaches up to execution accuracy. The study identifies seven primary error types (e.g., SELECT correctness, GROUP BY, value prediction, aggregates, JOINs) and reveals substantial inconsistencies and gaps in the Spider dataset that can misclassify correct queries. Overall, the results demonstrate the potential of correction-based strategies to boost Text-to-SQL performance, while highlighting the need for more nuanced evaluation of SQL correctness and dataset quality.

Abstract

This study investigates various approaches to using Large Language Models (LLMs) for Text-to-SQL program synthesis, focusing on the outcomes and insights derived. Employing the popular Text-to-SQL dataset, spider, the goal was to input a natural language question along with the database schema and output the correct SQL SELECT query. The initial approach was to fine-tune a local and open-source model to generate the SELECT query. After QLoRa fine-tuning WizardLM's WizardCoder-15B model on the spider dataset, the execution accuracy for generated queries rose to a high of 61%. With the second approach, using the fine-tuned gpt-3.5-turbo-16k (Few-shot) + gpt-4-turbo (Zero-shot error correction), the execution accuracy reached a high of 82.1%. Of all the incorrect queries, most can be categorized into a seven different categories of what went wrong: selecting the wrong columns or wrong order of columns, grouping by the wrong column, predicting the wrong values in conditionals, using different aggregates than the ground truth, extra or too few JOIN clauses, inconsistencies in the Spider dataset, and lastly completely incorrect query structure. Most if not all of the queries fall into these categories and it is insightful to understanding where the faults still lie with LLM program synthesis and where they can be improved.
Paper Structure (19 sections, 2 figures, 1 table)

This paper contains 19 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Shows the open-source fine-tuning dataset format for a single entry. The bold sections show the spider NL question, the database schema format, and the skeleton format in the response li2023resdsql.
  • Figure 2: Shows the clear context database schema format that OpenAI suggests using for Text-to-SQL program synthesis with their models. The C3 method, zero-shot state-of-the-art, also uses this db format dong2023c3.