$R^3$: "This is My SQL, Are You With Me?" A Consensus-Based Multi-Agent System for Text-to-SQL Tasks

Hanchen Xia; Feng Jiang; Naihao Deng; Cunxiang Wang; Guojiang Zhao; Rada Mihalcea; Yue Zhang

$R^3$: "This is My SQL, Are You With Me?" A Consensus-Based Multi-Agent System for Text-to-SQL Tasks

Hanchen Xia, Feng Jiang, Naihao Deng, Cunxiang Wang, Guojiang Zhao, Rada Mihalcea, Yue Zhang

TL;DR

Text-to-SQL remains challenging due to grounding and reasoning gaps in LLMs. R^3 introduces a consensus-based multi-agent framework with a SQL-writer and diverse reviewer agents that iteratively generate and refine SQL using execution results and a memory-enabled dialogue, aided by Program of Thought prompts and $k$-shot learning. The approach achieves strong results on the Spider and Bird benchmarks, including scenarios where open-source Llama-3-8B with R^3 outperforms GPT-3.5 on the Spider-Dev set, and it markedly surpasses prior single- and multi-agent baselines. An in-depth error analysis uncovers issues in gold annotations, question ambiguity, and dirty database values, underscoring the need for refined evaluation protocols in Text-to-SQL research.

Abstract

Large Language Models (LLMs) have demonstrated strong performance on various tasks. To unleash their power on the Text-to-SQL task, we propose $R^3$ (Review-Rebuttal-Revision), a consensus-based multi-agent system for Text-to-SQL tasks. $R^3$ outperforms the existing single LLM Text-to-SQL systems as well as the multi-agent Text-to-SQL systems by $1.3\%$ to $8.1\%$ on Spider and Bird. Surprisingly, we find that for Llama-3-8B, $R^3$ outperforms chain-of-thought prompting by over 20\%, even outperforming GPT-3.5 on the development set of Spider.

$R^3$: "This is My SQL, Are You With Me?" A Consensus-Based Multi-Agent System for Text-to-SQL Tasks

TL;DR

-shot learning. The approach achieves strong results on the Spider and Bird benchmarks, including scenarios where open-source Llama-3-8B with R^3 outperforms GPT-3.5 on the Spider-Dev set, and it markedly surpasses prior single- and multi-agent baselines. An in-depth error analysis uncovers issues in gold annotations, question ambiguity, and dirty database values, underscoring the need for refined evaluation protocols in Text-to-SQL research.

Abstract

Large Language Models (LLMs) have demonstrated strong performance on various tasks. To unleash their power on the Text-to-SQL task, we propose

(Review-Rebuttal-Revision), a consensus-based multi-agent system for Text-to-SQL tasks.

outperforms the existing single LLM Text-to-SQL systems as well as the multi-agent Text-to-SQL systems by

on Spider and Bird. Surprisingly, we find that for Llama-3-8B,

outperforms chain-of-thought prompting by over 20\%, even outperforming GPT-3.5 on the development set of Spider.

Paper Structure (24 sections, 2 equations, 2 figures, 5 tables, 1 algorithm)

This paper contains 24 sections, 2 equations, 2 figures, 5 tables, 1 algorithm.

Introduction
Architecture
SQL-Writer (SW).
Reviewers (REs).
Overall Architecture.
Experiments and Results
Ablation Studies
Error Analysis
Gold Error.
Ambiguity.
Dirty Database Value.
Conclusion
Appendix
Dataset Descriptions
Baseline
...and 9 more sections

Figures (2)

Figure 1: $\textit{R}^3$ Architecture. $n$ reviewer agents, each with distinct characteristics, are created to review the generated SQL and its execution result. The process continues until the master node (SQL-writer) and the other nodes reach a consensus, at which point the system outputs the final SQL.
Figure 2: $k$-shot Sensitivity Analysis.

$R^3$: "This is My SQL, Are You With Me?" A Consensus-Based Multi-Agent System for Text-to-SQL Tasks

TL;DR

Abstract

$R^3$: "This is My SQL, Are You With Me?" A Consensus-Based Multi-Agent System for Text-to-SQL Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (2)