Table of Contents
Fetching ...

SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications

Jinyang Li, Xiaolong Li, Ge Qu, Per Jacobsson, Bowen Qin, Binyuan Hui, Shuzheng Si, Nan Huo, Xiaohan Xu, Yue Zhang, Ziwei Tang, Yuanshuai Li, Florensia Widjaja, Xintong Zhu, Feige Zhou, Yongfeng Huang, Yannis Papakonstantinou, Fatma Ozcan, Chenhao Ma, Reynold Cheng

TL;DR

The paper tackles the real-world problem of debugging SQL queries by introducing BIRD-CRITIC, the first benchmark for SQL issue debugging with 530 PostgreSQL tasks and 570 multi-dialect tasks derived from authentic user issues, together with Six-Gym, an automated training environment that uses SQL-Rewind to generate executable issue-solution trajectories. It then presents Bird-Fixer, an open-source SQL debugging agent built on an SQL-Act scaffold and enhanced by f-Plan Boosting and Generative Thought Mode to dramatically improve trajectory quality and healthy generalization across dialects, achieving competitive performance against proprietary systems on PostgreSQL and multi-dialect tasks. The work provides a scalable, privacy-preserving pipeline for evaluating and training LLMs in realistic SQL debugging scenarios and offers detailed analyses via ablations and error taxonomy to guide future improvements. Collectively, these contributions push open-source capabilities closer to industrial-grade SQL debugging tools and pave the way for robust, dialect-robust, and governance-friendly database assistance.

Abstract

Resolution of complex SQL issues persists as a significant bottleneck in real-world database applications. Current Large Language Models (LLMs), while adept at text-to-SQL translation, have not been rigorously evaluated on the more challenging task of debugging SQL issues. To address this gap, we introduce BIRD-CRITIC, a new SQL issue debugging benchmark comprising 530 PostgreSQL tasks (BIRD-CRITIC-PG) and 570 multi-dialect tasks (BIRD-CRITIC-Multi), distilled from authentic user issues and replayed within new environments to facilitate rigorous evaluation. Baseline evaluations underscore the task's complexity, with the leading reasoning model O3-Mini achieving only 38.87% success rate on BIRD-CRITIC-PG and 33.33% on BIRD-CRITIC-Multi. Meanwhile, advancing open-source models for database tasks is crucial for empowering local development while safeguarding data privacy. Therefore, we present Six-Gym (Sql-fIX-Gym), a training environment for elevating open-source model capabilities for SQL issue debugging. This environment leverages SQL-Rewind strategy, which automatically generates executable issue-solution datasets by reverse-engineering issues from verified SQLs. However, popular trajectory-based fine-tuning methods do not explore substantial supervisory signals. We further propose f-Plan Boosting, which extracts high-level debugging plans from SQL solutions, enabling teacher LLMs to produce 73.7% more successful trajectories for training. We integrate these components into an open-source agent, Bird-Fixer. Based on Qwen-2.5-Coder-14B, Bird-Fixer achieves 38.11% success rate on BIRD-CRITIC-PG and 29.65% on BIRD-CRITIC-Multi, surpassing leading proprietary models such as Claude-3.7-Sonnet and GPT-4.1, marking a significant step toward democratizing sophisticated SQL-debugging capabilities. The leaderboard and source code are available: https://bird-critic.github.io/

SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications

TL;DR

The paper tackles the real-world problem of debugging SQL queries by introducing BIRD-CRITIC, the first benchmark for SQL issue debugging with 530 PostgreSQL tasks and 570 multi-dialect tasks derived from authentic user issues, together with Six-Gym, an automated training environment that uses SQL-Rewind to generate executable issue-solution trajectories. It then presents Bird-Fixer, an open-source SQL debugging agent built on an SQL-Act scaffold and enhanced by f-Plan Boosting and Generative Thought Mode to dramatically improve trajectory quality and healthy generalization across dialects, achieving competitive performance against proprietary systems on PostgreSQL and multi-dialect tasks. The work provides a scalable, privacy-preserving pipeline for evaluating and training LLMs in realistic SQL debugging scenarios and offers detailed analyses via ablations and error taxonomy to guide future improvements. Collectively, these contributions push open-source capabilities closer to industrial-grade SQL debugging tools and pave the way for robust, dialect-robust, and governance-friendly database assistance.

Abstract

Resolution of complex SQL issues persists as a significant bottleneck in real-world database applications. Current Large Language Models (LLMs), while adept at text-to-SQL translation, have not been rigorously evaluated on the more challenging task of debugging SQL issues. To address this gap, we introduce BIRD-CRITIC, a new SQL issue debugging benchmark comprising 530 PostgreSQL tasks (BIRD-CRITIC-PG) and 570 multi-dialect tasks (BIRD-CRITIC-Multi), distilled from authentic user issues and replayed within new environments to facilitate rigorous evaluation. Baseline evaluations underscore the task's complexity, with the leading reasoning model O3-Mini achieving only 38.87% success rate on BIRD-CRITIC-PG and 33.33% on BIRD-CRITIC-Multi. Meanwhile, advancing open-source models for database tasks is crucial for empowering local development while safeguarding data privacy. Therefore, we present Six-Gym (Sql-fIX-Gym), a training environment for elevating open-source model capabilities for SQL issue debugging. This environment leverages SQL-Rewind strategy, which automatically generates executable issue-solution datasets by reverse-engineering issues from verified SQLs. However, popular trajectory-based fine-tuning methods do not explore substantial supervisory signals. We further propose f-Plan Boosting, which extracts high-level debugging plans from SQL solutions, enabling teacher LLMs to produce 73.7% more successful trajectories for training. We integrate these components into an open-source agent, Bird-Fixer. Based on Qwen-2.5-Coder-14B, Bird-Fixer achieves 38.11% success rate on BIRD-CRITIC-PG and 29.65% on BIRD-CRITIC-Multi, surpassing leading proprietary models such as Claude-3.7-Sonnet and GPT-4.1, marking a significant step toward democratizing sophisticated SQL-debugging capabilities. The leaderboard and source code are available: https://bird-critic.github.io/

Paper Structure

This paper contains 64 sections, 2 equations, 10 figures, 10 tables, 2 algorithms.

Figures (10)

  • Figure 1: Illustration of the SQL issue debugging process in BIRD-CRITIC. It should start with a user issue query (left) and issue SQL query (center-left), LLMs will produce a corrected SQL solution (right) based on reasoning and interaction with the environment.
  • Figure 2: Example task structure within the BIRD-CRITIC benchmark, demonstrating the transformation from a user-reported issue and error SQL to a revised SQL solution.
  • Figure 3: Distribution of issue categories in all BIRD-CRITIC, derived from an analysis of SQL usage in the real-world database applications. A detailed distribution is in Appendix \ref{['app:detailed_stats']}.
  • Figure 4: LLM agent performance for BIRD-CRITIC-PG. TooAct employs constrained toolkit as actions, while SQLAct executes SQLs as actions.
  • Figure 5: Success Rate vs Query Diversity (by Issue Category). It shows a strong negative correlation ($r$ = -0.89) between n-gram of tokens and model performance after normalization.
  • ...and 5 more figures