Table of Contents
Fetching ...

BAPPA: Benchmarking Agents, Plans, and Pipelines for Automated Text-to-SQL Generation

Fahim Ahmed, Md Mubtasim Ahasan, Jahir Sadik Monon, Muntasir Wahed, M Ashraful Amin, A K M Mahbubur Rahman, Amin Ahsan Ali

TL;DR

The paper tackles the Text-to-SQL challenge in open-source LLMs by introducing three multi-agent pipelines—Multi-Agent Discussion, Planner–Coder, and Coder–Aggregator—and validating them across 24 models on the BIRD Mini-Dev and Spider Dev benchmarks. It demonstrates that collaboration and structured reasoning can substantially uplift SQL generation quality, especially for smaller and mid-sized models, and that aggregation and joint planning further improve reliability. The work provides a scalable, open framework that narrows the gap between open models and proprietary systems, with practical implications for privacy-conscious and resource-constrained deployments. Code and prompts are released to enable broader adoption and benchmarking of open-source Text-to-SQL systems.

Abstract

Text-to-SQL systems provide a natural language interface that can enable even laymen to access information stored in databases. However, existing Large Language Models (LLM) struggle with SQL generation from natural instructions due to large schema sizes and complex reasoning. Prior work often focuses on complex, somewhat impractical pipelines using flagship models, while smaller, efficient models remain overlooked. In this work, we explore three multi-agent LLM pipelines, with systematic performance benchmarking across a range of small to large open-source models: (1) Multi-agent discussion pipeline, where agents iteratively critique and refine SQL queries, and a judge synthesizes the final answer; (2) Planner-Coder pipeline, where a thinking model planner generates stepwise SQL generation plans and a coder synthesizes queries; and (3) Coder-Aggregator pipeline, where multiple coders independently generate SQL queries, and a reasoning agent selects the best query. Experiments on the Bird-Bench Mini-Dev set reveal that Multi-Agent discussion can improve small model performance, with up to 10.6% increase in Execution Accuracy for Qwen2.5-7b-Instruct seen after three rounds of discussion. Among the pipelines, the LLM Reasoner-Coder pipeline yields the best results, with DeepSeek-R1-32B and QwQ-32B planners boosting Gemma 3 27B IT accuracy from 52.4% to the highest score of 56.4%. Codes are available at https://github.com/treeDweller98/bappa-sql.

BAPPA: Benchmarking Agents, Plans, and Pipelines for Automated Text-to-SQL Generation

TL;DR

The paper tackles the Text-to-SQL challenge in open-source LLMs by introducing three multi-agent pipelines—Multi-Agent Discussion, Planner–Coder, and Coder–Aggregator—and validating them across 24 models on the BIRD Mini-Dev and Spider Dev benchmarks. It demonstrates that collaboration and structured reasoning can substantially uplift SQL generation quality, especially for smaller and mid-sized models, and that aggregation and joint planning further improve reliability. The work provides a scalable, open framework that narrows the gap between open models and proprietary systems, with practical implications for privacy-conscious and resource-constrained deployments. Code and prompts are released to enable broader adoption and benchmarking of open-source Text-to-SQL systems.

Abstract

Text-to-SQL systems provide a natural language interface that can enable even laymen to access information stored in databases. However, existing Large Language Models (LLM) struggle with SQL generation from natural instructions due to large schema sizes and complex reasoning. Prior work often focuses on complex, somewhat impractical pipelines using flagship models, while smaller, efficient models remain overlooked. In this work, we explore three multi-agent LLM pipelines, with systematic performance benchmarking across a range of small to large open-source models: (1) Multi-agent discussion pipeline, where agents iteratively critique and refine SQL queries, and a judge synthesizes the final answer; (2) Planner-Coder pipeline, where a thinking model planner generates stepwise SQL generation plans and a coder synthesizes queries; and (3) Coder-Aggregator pipeline, where multiple coders independently generate SQL queries, and a reasoning agent selects the best query. Experiments on the Bird-Bench Mini-Dev set reveal that Multi-Agent discussion can improve small model performance, with up to 10.6% increase in Execution Accuracy for Qwen2.5-7b-Instruct seen after three rounds of discussion. Among the pipelines, the LLM Reasoner-Coder pipeline yields the best results, with DeepSeek-R1-32B and QwQ-32B planners boosting Gemma 3 27B IT accuracy from 52.4% to the highest score of 56.4%. Codes are available at https://github.com/treeDweller98/bappa-sql.

Paper Structure

This paper contains 23 sections, 11 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Overview of Proposed Multi-Agent Pipelines for Text-to-SQL. We propose three pipelines: (i) Multi-agent Discussion, where Discussion Agents iteratively critique and refine each other's responses before a final SQL is selected by a Judge Agent; (ii) Planner-Coder, where a Planner Agent generates a step-by-step outline used by a Coder Agent to synthesize SQL; and (iii) Coder-Aggregator, where multiple Coder Agents generate candidate SQL queries, and an Aggregator Agent selects the final output. All pipelines take a schema and question as input.
  • Figure 2: Zero-shot baseline prompt. The "Let's think step by step" portion is removed for reasoning models according to best practices outlined by model publishers.
  • Figure 3: Zero-shot prompt used by the Starter Agent in the Multi-Agent Discussion pipeline. Each agent, assigned a unique persona, generates an initial SQL query. These starter responses are then reviewed by neighboring agents in the first discussion round.
  • Figure 4: Prompt given to Discussion Agents in the Multi-Agent Discussion pipeline. Each agent considers the responses of others across multiple rounds to refine its own SQL query. This example shows the prompt for Discussion Agent 2, incorporating responses from Agents 1 and 3. Final responses from all agents are passed to the Judge Agent (Figure \ref{['fig:mad_judge_prompt']}) to generate the final SQL query.
  • Figure 5: Prompt given to the Judge Agent in the Multi-Agent Discussion pipeline. Judge Agent reviews the outputs of all Discussion Agents (Figure \ref{['fig:mad_discuss_prompt']}) after each round and produces the final SQL query.
  • ...and 5 more figures