OraPlan-SQL: A Planning-Centric Framework for Complex Bilingual NL2SQL Reasoning
Marianne Menglin Liu, Sai Ashish Somayajula, Syed Fahad Allam Shah, Sujith Ravi, Dan Roth
TL;DR
This work tackles bilingual NL2SQL with complex reasoning by introducing OraPlan-SQL, a two-stage framework combining a Planner and a SQL Agent. The key innovations are a feedback-guided meta-prompting pipeline to refine the planner, multilingual entity-linking guidelines to mitigate transliteration and surface-form mismatches, and plan diversification with majority voting to boost robustness. Enriching prompts with schema retrieval and in-context examples enhances schema awareness and reduces errors, while diverse plan generation and consensus execution yield strong, language-robust performance. On the Archer NL2SQL benchmark, OraPlan-SQL achieves state-of-the-art bilingual execution accuracy (EN $=$ $54.96\%$, ZH $=$ $56.67\%$) and near-perfect SQL validity, surpassing competitors by meaningful margins and narrowing the cross-lingual gap. Overall, the approach demonstrates that planning-centric prompting and targeted multilingual refinements can deliver reliable, scalable NL2SQL systems without heavy multi-agent orchestration.
Abstract
We present OraPlan-SQL, our system for the Archer NL2SQL Evaluation Challenge 2025, a bilingual benchmark requiring complex reasoning such as arithmetic, commonsense, and hypothetical inference. OraPlan-SQL ranked first, exceeding the second-best system by more than 6% in execution accuracy (EX), with 55.0% in English and 56.7% in Chinese, while maintaining over 99% SQL validity (VA). Our system follows an agentic framework with two components: Planner agent that generates stepwise natural language plans, and SQL agent that converts these plans into executable SQL. Since SQL agent reliably adheres to the plan, our refinements focus on the planner. Unlike prior methods that rely on multiple sub-agents for planning and suffer from orchestration overhead, we introduce a feedback-guided meta-prompting strategy to refine a single planner. Failure cases from a held-out set are clustered with human input, and an LLM distills them into corrective guidelines that are integrated into the planner's system prompt, improving generalization without added complexity. For the multilingual scenario, to address transliteration and entity mismatch issues, we incorporate entity-linking guidelines that generate alternative surface forms for entities and explicitly include them in the plan. Finally, we enhance reliability through plan diversification: multiple candidate plans are generated for each query, with the SQL agent producing a query for each plan, and final output selected via majority voting over their executions.
