Table of Contents
Fetching ...

RetrySQL: text-to-SQL training with retry data for self-correcting query generation

Alicja Rączkowska, Riccardo Belluzzo, Piotr Zieliński, Joanna Baran, Paweł Olszewski

TL;DR

RetrySQL introduces a self-correcting training paradigm for text-to-SQL by augmenting training data with reasoning steps and retry data, where steps are corrupted with [BACK] tokens to teach backtracking during generation. The approach uses GPT-4o to generate reasoning chains, then pre-trains open-source coding LLMs with these augmented examples, achieving up to $4$ percentage-point improvements in the execution accuracy metric $EX$ on BIRD and SPIDER, and showing that full-parameter pre-training is required for the effect. In end-to-end text-to-SQL pipelines, RetrySQL-trained 1.5B-parameter models are competitive with much larger proprietary models, demonstrating practical viability for SQL-oriented generation. The work also analyzes model confidence around [BACK] tokens, evidencing learned self-correction, and discusses limitations and avenues for future research.

Abstract

The text-to-SQL task is an active challenge in Natural Language Processing. Many existing solutions focus on using black-box language models extended with specialized components within customized end-to-end text-to-SQL pipelines. While these solutions use both closed-source proprietary language models and coding-oriented open-source models, there is a lack of research regarding SQL-specific generative models. At the same time, recent advancements in self-correcting generation strategies show promise for improving the capabilities of existing architectures. The application of these concepts to the text-to-SQL task remains unexplored. In this paper, we introduce RetrySQL, a new approach to training text-to-SQL generation models. We prepare reasoning steps for reference SQL queries and then corrupt them to create retry data that contains both incorrect and corrected steps, divided with a special token. We continuously pre-train an open-source coding model with this data and demonstrate that retry steps yield an improvement of up to 4 percentage points in both overall and challenging execution accuracy metrics, compared to pre-training without retry data. Additionally, we confirm that supervised fine-tuning with LoRA is ineffective for learning from retry data and that full-parameter pre-training is a necessary requirement for that task. We showcase that the self-correcting behavior is learned by the model and the increase in downstream accuracy metrics is a result of this additional skill. Finally, we incorporate RetrySQL-trained models into the full text-to-SQL pipeline and showcase that they are competitive in terms of execution accuracy with proprietary models that contain orders of magnitude more parameters. RetrySQL demonstrates that self-correction can be learned in the text-to-SQL task and provides a novel way of improving generation accuracy for SQL-oriented language models.

RetrySQL: text-to-SQL training with retry data for self-correcting query generation

TL;DR

RetrySQL introduces a self-correcting training paradigm for text-to-SQL by augmenting training data with reasoning steps and retry data, where steps are corrupted with [BACK] tokens to teach backtracking during generation. The approach uses GPT-4o to generate reasoning chains, then pre-trains open-source coding LLMs with these augmented examples, achieving up to percentage-point improvements in the execution accuracy metric on BIRD and SPIDER, and showing that full-parameter pre-training is required for the effect. In end-to-end text-to-SQL pipelines, RetrySQL-trained 1.5B-parameter models are competitive with much larger proprietary models, demonstrating practical viability for SQL-oriented generation. The work also analyzes model confidence around [BACK] tokens, evidencing learned self-correction, and discusses limitations and avenues for future research.

Abstract

The text-to-SQL task is an active challenge in Natural Language Processing. Many existing solutions focus on using black-box language models extended with specialized components within customized end-to-end text-to-SQL pipelines. While these solutions use both closed-source proprietary language models and coding-oriented open-source models, there is a lack of research regarding SQL-specific generative models. At the same time, recent advancements in self-correcting generation strategies show promise for improving the capabilities of existing architectures. The application of these concepts to the text-to-SQL task remains unexplored. In this paper, we introduce RetrySQL, a new approach to training text-to-SQL generation models. We prepare reasoning steps for reference SQL queries and then corrupt them to create retry data that contains both incorrect and corrected steps, divided with a special token. We continuously pre-train an open-source coding model with this data and demonstrate that retry steps yield an improvement of up to 4 percentage points in both overall and challenging execution accuracy metrics, compared to pre-training without retry data. Additionally, we confirm that supervised fine-tuning with LoRA is ineffective for learning from retry data and that full-parameter pre-training is a necessary requirement for that task. We showcase that the self-correcting behavior is learned by the model and the increase in downstream accuracy metrics is a result of this additional skill. Finally, we incorporate RetrySQL-trained models into the full text-to-SQL pipeline and showcase that they are competitive in terms of execution accuracy with proprietary models that contain orders of magnitude more parameters. RetrySQL demonstrates that self-correction can be learned in the text-to-SQL task and provides a novel way of improving generation accuracy for SQL-oriented language models.

Paper Structure

This paper contains 33 sections, 14 figures, 6 tables.

Figures (14)

  • Figure 1: RetrySQL overview. (a) Reasoning step generation. For each SQL query in the training dataset, we generate a series of reasoning steps using GPT-4o. (b) Preparation of retry data. For each set of reasoning steps, we apply random perturbations, treated as errors, by replacing some steps with different ones. We follow these errors with special [BACK] tokens and amend them with correct steps. (c) We take an open-source LLM and continue its pre-training with training examples that contain retry data injected into reasoning steps. The resulting RetrySQL-trained model learns the ability to self-correct, which improves its capabilities in generating correct SQL queries from natural language questions.
  • Figure 2: t-SNE projection of OpenCoder's internal state embeddings for the linear probing task. Blue points represent embeddings corresponding to correct reasoning steps, while orange points indicate embeddings for incorrect steps. The clusters of orange points indicate that the OpenCoder model differentiates a large portion of the incorrect steps from the correct ones, highlighting the innate, yet hidden, ability to detect mistakes in the reasoning process.
  • Figure 3: Distribution of token confidence before and after [BACK] tokens. (a) Mean of max token confidence across 10 beam search passes. It can be seen that the confidence score is on average much higher for tokens after the [BACK] token, indicating that the model is uncertain as it makes mistakes, but is confident after self-correction. (b) Standard deviation of max token confidence across 10 beam search passes. The variance of model predictions is much higher as it makes mistakes than after self-correction.
  • Figure S1: Distribution of SQL query complexity in the BIRD development dataset. There is a significant overlap in the number of level-1 expressions in the SQL syntax tree across difficulty levels defined in the BIRD development dataset. Due to our reasoning generation strategy (see Section \ref{['paragraph: reasoning_data']}), the number of these expressions is a proxy for the number of reasoning steps. We parsed the ground truth SQL queries with the SQLGlot Python library sqlglot and extracted level-1 elements from the corresponding syntax trees.
  • Figure S2: Example of a training data sample used in our experiments. It consists of the following elements: DDL statements for schema representation, external knowledge and question extracted from the BIRD metadata, reasoning steps generated as described in Section \ref{['paragraph: reasoning_data']} and the ground truth SQL query. These components are separated by special tokens to guide the model in the learning process.
  • ...and 9 more figures