Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling

Pengfei Wang; Baolin Sun; Xuemei Dong; Yaxun Dai; Hongwei Yuan; Mengdie Chu; Yingqi Gao; Xiang Qi; Peng Zhang; Ying Yan

Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling

Pengfei Wang, Baolin Sun, Xuemei Dong, Yaxun Dai, Hongwei Yuan, Mengdie Chu, Yingqi Gao, Xiang Qi, Peng Zhang, Ying Yan

TL;DR

The paper tackles the Text-to-SQL gap on the BIRD benchmark by introducing Agentar-Scale-SQL, a generalizable orchestrated test-time scaling framework that integrates internal RL-enhanced reasoning, sequential iterative refinement, and parallel diverse synthesis with tournament-based selection. It verifies state-of-the-art performance (EX 81.67%, R-VES 77.00%) on the BIRD test set and provides detailed ablations showing the critical roles of each component. The approach emphasizes scalable, modular design over hand-crafted heuristics, enabling better handling of complex schemas across domains. It also discusses practical limitations like compute cost and latency, while outlining promising future directions toward more autonomous, action-based learning in code-generation contexts.

Abstract

State-of-the-art (SOTA) Text-to-SQL methods still lag significantly behind human experts on challenging benchmarks like BIRD. Current approaches that explore test-time scaling lack an orchestrated strategy and neglect the model's internal reasoning process. To bridge this gap, we introduce Agentar-Scale-SQL, a novel framework leveraging scalable computation to improve performance. Agentar-Scale-SQL implements an Orchestrated Test-Time Scaling strategy that synergistically combines three distinct perspectives: i) Internal Scaling via RL-enhanced Intrinsic Reasoning, ii) Sequential Scaling through Iterative Refinement, and iii) Parallel Scaling using Diverse Synthesis and Tournament Selection. Agentar-Scale-SQL is a general-purpose framework designed for easy adaptation to new databases and more powerful language models. Extensive experiments show that Agentar-Scale-SQL achieves SOTA performance on the BIRD benchmark, reaching 81.67% execution accuracy on the test set and ranking first on the official leaderboard, demonstrating an effective path toward human-level performance.

Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling

TL;DR

Abstract

Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (18)