Table of Contents
Fetching ...

$R^3$: "This is My SQL, Are You With Me?" A Consensus-Based Multi-Agent System for Text-to-SQL Tasks

Hanchen Xia, Feng Jiang, Naihao Deng, Cunxiang Wang, Guojiang Zhao, Rada Mihalcea, Yue Zhang

TL;DR

Text-to-SQL remains challenging due to grounding and reasoning gaps in LLMs. R^3 introduces a consensus-based multi-agent framework with a SQL-writer and diverse reviewer agents that iteratively generate and refine SQL using execution results and a memory-enabled dialogue, aided by Program of Thought prompts and $k$-shot learning. The approach achieves strong results on the Spider and Bird benchmarks, including scenarios where open-source Llama-3-8B with R^3 outperforms GPT-3.5 on the Spider-Dev set, and it markedly surpasses prior single- and multi-agent baselines. An in-depth error analysis uncovers issues in gold annotations, question ambiguity, and dirty database values, underscoring the need for refined evaluation protocols in Text-to-SQL research.

Abstract

Large Language Models (LLMs) have demonstrated strong performance on various tasks. To unleash their power on the Text-to-SQL task, we propose $R^3$ (Review-Rebuttal-Revision), a consensus-based multi-agent system for Text-to-SQL tasks. $R^3$ outperforms the existing single LLM Text-to-SQL systems as well as the multi-agent Text-to-SQL systems by $1.3\%$ to $8.1\%$ on Spider and Bird. Surprisingly, we find that for Llama-3-8B, $R^3$ outperforms chain-of-thought prompting by over 20\%, even outperforming GPT-3.5 on the development set of Spider.

$R^3$: "This is My SQL, Are You With Me?" A Consensus-Based Multi-Agent System for Text-to-SQL Tasks

TL;DR

Text-to-SQL remains challenging due to grounding and reasoning gaps in LLMs. R^3 introduces a consensus-based multi-agent framework with a SQL-writer and diverse reviewer agents that iteratively generate and refine SQL using execution results and a memory-enabled dialogue, aided by Program of Thought prompts and -shot learning. The approach achieves strong results on the Spider and Bird benchmarks, including scenarios where open-source Llama-3-8B with R^3 outperforms GPT-3.5 on the Spider-Dev set, and it markedly surpasses prior single- and multi-agent baselines. An in-depth error analysis uncovers issues in gold annotations, question ambiguity, and dirty database values, underscoring the need for refined evaluation protocols in Text-to-SQL research.

Abstract

Large Language Models (LLMs) have demonstrated strong performance on various tasks. To unleash their power on the Text-to-SQL task, we propose (Review-Rebuttal-Revision), a consensus-based multi-agent system for Text-to-SQL tasks. outperforms the existing single LLM Text-to-SQL systems as well as the multi-agent Text-to-SQL systems by to on Spider and Bird. Surprisingly, we find that for Llama-3-8B, outperforms chain-of-thought prompting by over 20\%, even outperforming GPT-3.5 on the development set of Spider.
Paper Structure (24 sections, 2 equations, 2 figures, 5 tables, 1 algorithm)

This paper contains 24 sections, 2 equations, 2 figures, 5 tables, 1 algorithm.

Figures (2)

  • Figure 1: $\textit{R}^3$ Architecture. $n$ reviewer agents, each with distinct characteristics, are created to review the generated SQL and its execution result. The process continues until the master node (SQL-writer) and the other nodes reach a consensus, at which point the system outputs the final SQL.
  • Figure 2: $k$-shot Sensitivity Analysis.