RetrySQL: text-to-SQL training with retry data for self-correcting query generation

Alicja Rączkowska; Riccardo Belluzzo; Piotr Zieliński; Joanna Baran; Paweł Olszewski

RetrySQL: text-to-SQL training with retry data for self-correcting query generation

Alicja Rączkowska, Riccardo Belluzzo, Piotr Zieliński, Joanna Baran, Paweł Olszewski

TL;DR

RetrySQL introduces a self-correcting training paradigm for text-to-SQL by augmenting training data with reasoning steps and retry data, where steps are corrupted with [BACK] tokens to teach backtracking during generation. The approach uses GPT-4o to generate reasoning chains, then pre-trains open-source coding LLMs with these augmented examples, achieving up to $4$ percentage-point improvements in the execution accuracy metric $EX$ on BIRD and SPIDER, and showing that full-parameter pre-training is required for the effect. In end-to-end text-to-SQL pipelines, RetrySQL-trained 1.5B-parameter models are competitive with much larger proprietary models, demonstrating practical viability for SQL-oriented generation. The work also analyzes model confidence around [BACK] tokens, evidencing learned self-correction, and discusses limitations and avenues for future research.

Abstract

The text-to-SQL task is an active challenge in Natural Language Processing. Many existing solutions focus on using black-box language models extended with specialized components within customized end-to-end text-to-SQL pipelines. While these solutions use both closed-source proprietary language models and coding-oriented open-source models, there is a lack of research regarding SQL-specific generative models. At the same time, recent advancements in self-correcting generation strategies show promise for improving the capabilities of existing architectures. The application of these concepts to the text-to-SQL task remains unexplored. In this paper, we introduce RetrySQL, a new approach to training text-to-SQL generation models. We prepare reasoning steps for reference SQL queries and then corrupt them to create retry data that contains both incorrect and corrected steps, divided with a special token. We continuously pre-train an open-source coding model with this data and demonstrate that retry steps yield an improvement of up to 4 percentage points in both overall and challenging execution accuracy metrics, compared to pre-training without retry data. Additionally, we confirm that supervised fine-tuning with LoRA is ineffective for learning from retry data and that full-parameter pre-training is a necessary requirement for that task. We showcase that the self-correcting behavior is learned by the model and the increase in downstream accuracy metrics is a result of this additional skill. Finally, we incorporate RetrySQL-trained models into the full text-to-SQL pipeline and showcase that they are competitive in terms of execution accuracy with proprietary models that contain orders of magnitude more parameters. RetrySQL demonstrates that self-correction can be learned in the text-to-SQL task and provides a novel way of improving generation accuracy for SQL-oriented language models.

RetrySQL: text-to-SQL training with retry data for self-correcting query generation

TL;DR

Abstract

RetrySQL: text-to-SQL training with retry data for self-correcting query generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)