Table of Contents
Fetching ...

Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling

Pengfei Wang, Baolin Sun, Xuemei Dong, Yaxun Dai, Hongwei Yuan, Mengdie Chu, Yingqi Gao, Xiang Qi, Peng Zhang, Ying Yan

TL;DR

The paper tackles the Text-to-SQL gap on the BIRD benchmark by introducing Agentar-Scale-SQL, a generalizable orchestrated test-time scaling framework that integrates internal RL-enhanced reasoning, sequential iterative refinement, and parallel diverse synthesis with tournament-based selection. It verifies state-of-the-art performance (EX 81.67%, R-VES 77.00%) on the BIRD test set and provides detailed ablations showing the critical roles of each component. The approach emphasizes scalable, modular design over hand-crafted heuristics, enabling better handling of complex schemas across domains. It also discusses practical limitations like compute cost and latency, while outlining promising future directions toward more autonomous, action-based learning in code-generation contexts.

Abstract

State-of-the-art (SOTA) Text-to-SQL methods still lag significantly behind human experts on challenging benchmarks like BIRD. Current approaches that explore test-time scaling lack an orchestrated strategy and neglect the model's internal reasoning process. To bridge this gap, we introduce Agentar-Scale-SQL, a novel framework leveraging scalable computation to improve performance. Agentar-Scale-SQL implements an Orchestrated Test-Time Scaling strategy that synergistically combines three distinct perspectives: i) Internal Scaling via RL-enhanced Intrinsic Reasoning, ii) Sequential Scaling through Iterative Refinement, and iii) Parallel Scaling using Diverse Synthesis and Tournament Selection. Agentar-Scale-SQL is a general-purpose framework designed for easy adaptation to new databases and more powerful language models. Extensive experiments show that Agentar-Scale-SQL achieves SOTA performance on the BIRD benchmark, reaching 81.67% execution accuracy on the test set and ranking first on the official leaderboard, demonstrating an effective path toward human-level performance.

Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling

TL;DR

The paper tackles the Text-to-SQL gap on the BIRD benchmark by introducing Agentar-Scale-SQL, a generalizable orchestrated test-time scaling framework that integrates internal RL-enhanced reasoning, sequential iterative refinement, and parallel diverse synthesis with tournament-based selection. It verifies state-of-the-art performance (EX 81.67%, R-VES 77.00%) on the BIRD test set and provides detailed ablations showing the critical roles of each component. The approach emphasizes scalable, modular design over hand-crafted heuristics, enabling better handling of complex schemas across domains. It also discusses practical limitations like compute cost and latency, while outlining promising future directions toward more autonomous, action-based learning in code-generation contexts.

Abstract

State-of-the-art (SOTA) Text-to-SQL methods still lag significantly behind human experts on challenging benchmarks like BIRD. Current approaches that explore test-time scaling lack an orchestrated strategy and neglect the model's internal reasoning process. To bridge this gap, we introduce Agentar-Scale-SQL, a novel framework leveraging scalable computation to improve performance. Agentar-Scale-SQL implements an Orchestrated Test-Time Scaling strategy that synergistically combines three distinct perspectives: i) Internal Scaling via RL-enhanced Intrinsic Reasoning, ii) Sequential Scaling through Iterative Refinement, and iii) Parallel Scaling using Diverse Synthesis and Tournament Selection. Agentar-Scale-SQL is a general-purpose framework designed for easy adaptation to new databases and more powerful language models. Extensive experiments show that Agentar-Scale-SQL achieves SOTA performance on the BIRD benchmark, reaching 81.67% execution accuracy on the test set and ranking first on the official leaderboard, demonstrating an effective path toward human-level performance.

Paper Structure

This paper contains 32 sections, 9 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: The proposed Agentar-Scale-SQL framework.
  • Figure 2: An example of a database schema represented in both DDL schema and light schema formats.
  • Figure 3: Execution accuracy of voting, selection model, and upper bound across generator components.
  • Figure 4: Shared and unique correct samples between ICL and reasoning generators.
  • Figure 5: Number of correct samples by difficulty level for ICL, reasoning, and combined generators.
  • ...and 13 more figures