ReViSQL: Achieving Human-Level Text-to-SQL

Yuxuan Zhu; Tengjun Jin; Yoojin Choi; Daniel Kang

ReViSQL: Achieving Human-Level Text-to-SQL

Yuxuan Zhu, Tengjun Jin, Yoojin Choi, Daniel Kang

Abstract

Translating natural language to SQL (Text-to-SQL) is a critical challenge in both database research and data analytics applications. Recent efforts have focused on enhancing SQL reasoning by developing large language models and AI agents that decompose Text-to-SQL tasks into manually designed, step-by-step pipelines. However, despite these extensive architectural engineering efforts, a significant gap remains: even state-of-the-art (SOTA) AI agents have not yet achieved the human-level accuracy on the BIRD benchmark. In this paper, we show that closing this gap does not require further architectural complexity, but rather clean training data to improve SQL reasoning of the underlying models. We introduce ReViSQL, a streamlined framework that achieves human-level accuracy on BIRD for the first time. Instead of complex AI agents, ReViSQL leverages reinforcement learning with verifiable rewards (RLVR) on BIRD-Verified, a dataset we curated comprising 2.5k verified Text-to-SQL instances based on the BIRD Train set. To construct BIRD-Verified, we design a data correction and verification workflow involving SQL experts. We identified and corrected data errors in 61.1% of a subset of BIRD Train. By training on BIRD-Verified, we show that improving data quality alone boosts the single-generation accuracy by 8.2-13.9% under the same RLVR algorithm. To further enhance performance, ReViSQL performs inference-time scaling via execution-based reconciliation and majority voting. Empirically, we demonstrate the superiority of our framework with two model scales: ReViSQL-235B-A22B and ReViSQL-30B-A3B. On an expert-verified BIRD Mini-Dev set, ReViSQL-235B-A22B achieves 93.2% execution accuracy, exceeding the proxy human-level accuracy (92.96%) and outperforming the prior open-source SOTA method by 9.8%. Our lightweight ReViSQL-30B-A3B matches the prior SOTA at a 7.5$\times$ lower per-query cost.

ReViSQL: Achieving Human-Level Text-to-SQL

Abstract

lower per-query cost.

Paper Structure (18 sections, 3 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 18 sections, 3 equations, 8 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Text-to-SQL Methods
Noise in Text-to-SQL Datasets
Overview
BIRD-Verified: A Verified Dataset for Text-to-SQL RLVR
Data Curation
Statistics
The ReViSQL Framework
Training with Verified Data
Inference-time Scaling with Reconciliation
Evaluation
Experimental Setup
ReViSQL Achieves Human-parity and Outperforms Open-Source Agents
ReViSQL Outperforms Single-model Baselines
...and 3 more sections

Figures (8)

Figure 1: ReViSQL achieves human-level accuracy on an expert-verified BIRD Mini-Dev set constructed by prior work jin2026pervasiveArcwise-minidev. Compared to the SOTA open-source agent on the BIRD leaderboard li2023can, ReViSQL-235B-A22B achieves up to 9.8% higher accuracy. ReViSQL-30B-A3B matches the performance of the SOTA agent with 7.5$\times$ lower costs. ReViSQL dominates existing methods at all cost levels.
Figure 2: The end-to-end ReViSQL framework. ReViSQL achieves human parity Text-to-SQL through three core steps: (a) Training data curation: A rigorous, expert-driven correction and verification pipeline that convert noisy training data into the BIRD-Verified dataset. (b) RLVR training: An open-source LLM generates multiple reasoning rollouts and receives rewards based on execution correctness against the verified gold SQL query, effectively internalizing reasoning capabilities. (c) Inference-time scaling: At inference, the finetuned LLM generates multiple candidate queries which are grouped by execution result, reconciled against the user's explicit intent using a pre-RLVR base model, and finalized via majority voting.
Figure 3: Quantitative analysis of the BIRD-Verified expert curation process. Relying solely on automated LLMs for data correction is insufficient due to poor recall (Fig. \ref{['fig:llm-reviewer-metrics']}). Consequently, ReViSQL uses an expert verification pipeline that requires up to four iterative rounds to fully resolve errors (Fig. \ref{['fig:veri-conflicts']}). We identified and corrected errors across 52.1% of SQL queries, 26.2% of natural language questions, and 18.2% of external knowledge contexts (Fig. \ref{['fig:error-types']}).
Figure 4: Accuracy of ReViSQL and baseline models under inference-time scaling constraints. Across both BIRD datasets, ReViSQL models consistently achieve significantly higher execution accuracy than all baselines when scaling the number of candidates from 4 to 32 (Fig. \ref{['fig:single-sql']} and \ref{['fig:single-full']}). Furthermore, in terms of cost-efficiency, ReViSQL establishes a strict new Pareto frontier, delivering higher accuracy at substantially lower costs compared to expensive models like GPT-5.2 (Fig. \ref{['fig:single-sql-cost']} and \ref{['fig:single-full-cost']}).
Figure 5: BIRD-Verified prevents spurious reward optimization during RLVR training. While RLVR successfully drives up the training rewards for both BIRD-Verified and the original BIRD Train set (Fig. \ref{['fig:ablation-rewards']}), this optimization only translates to test accuracy improvement on verified data (Fig. \ref{['fig:ablation-test']}).
...and 3 more figures

ReViSQL: Achieving Human-Level Text-to-SQL

Abstract

ReViSQL: Achieving Human-Level Text-to-SQL

Authors

Abstract

Table of Contents

Figures (8)