Table of Contents
Fetching ...

Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks

Jiajing Guo, Kenil Patel, Jorge Piazentin Ono, Wenbin He, Liu Ren

TL;DR

The paper tackles the practical deployment of Text-to-SQL systems by evaluating six lightweight, inference-based agentic workflows across four LLMs on the BIRD Mini-Dev benchmark, with a focus on balancing accuracy, latency, and token usage. It finds that Divide-and-Conquer prompting combined with few-shot demonstrations yields consistent improvements, even for reasoning-focused models, while more complex workflows offer mixed benefits and can increase latency. A strong base model can outperform highly engineered workflows, underscoring the importance of model selection. The work provides actionable guidance for practitioners seeking deployment-ready Text-to-SQL solutions and highlights trade-offs between accuracy and efficiency in real-world settings.

Abstract

Large language models (LLMs) are increasingly powering Text-to-SQL (Text2SQL) systems, enabling non-expert users to query industrial databases using natural language. While test-time scaling strategies have shown promise in LLM-based solutions, their effectiveness in real-world applications, especially with the latest reasoning models, remains uncertain. In this work, we benchmark six lightweight, industry-oriented test-time scaling strategies and four LLMs, including two reasoning models, evaluating their performance on the BIRD Mini-Dev benchmark. Beyond standard accuracy metrics, we also report inference latency and token consumption, providing insights relevant for practical system deployment. Our findings reveal that Divide-and-Conquer prompting and few-shot demonstrations consistently enhance performance for both general-purpose and reasoning-focused LLMs. However, introducing additional workflow steps yields mixed results, and base model selection plays a critical role. This work sheds light on the practical trade-offs between accuracy, efficiency, and complexity when deploying Text2SQL systems.

Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks

TL;DR

The paper tackles the practical deployment of Text-to-SQL systems by evaluating six lightweight, inference-based agentic workflows across four LLMs on the BIRD Mini-Dev benchmark, with a focus on balancing accuracy, latency, and token usage. It finds that Divide-and-Conquer prompting combined with few-shot demonstrations yields consistent improvements, even for reasoning-focused models, while more complex workflows offer mixed benefits and can increase latency. A strong base model can outperform highly engineered workflows, underscoring the importance of model selection. The work provides actionable guidance for practitioners seeking deployment-ready Text-to-SQL solutions and highlights trade-offs between accuracy and efficiency in real-world settings.

Abstract

Large language models (LLMs) are increasingly powering Text-to-SQL (Text2SQL) systems, enabling non-expert users to query industrial databases using natural language. While test-time scaling strategies have shown promise in LLM-based solutions, their effectiveness in real-world applications, especially with the latest reasoning models, remains uncertain. In this work, we benchmark six lightweight, industry-oriented test-time scaling strategies and four LLMs, including two reasoning models, evaluating their performance on the BIRD Mini-Dev benchmark. Beyond standard accuracy metrics, we also report inference latency and token consumption, providing insights relevant for practical system deployment. Our findings reveal that Divide-and-Conquer prompting and few-shot demonstrations consistently enhance performance for both general-purpose and reasoning-focused LLMs. However, introducing additional workflow steps yields mixed results, and base model selection plays a critical role. This work sheds light on the practical trade-offs between accuracy, efficiency, and complexity when deploying Text2SQL systems.

Paper Structure

This paper contains 10 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Soft-F1 Score (left) and Execution Error Rate (right). The label DC 3-shot x5 Parallel, DC 3-shot+Verification, and Retrieval+DC 3-shot denote the workflows that use parallel scaling, result verification, and retrieval-enhanced techniques, respectively. For visual clarity, the "+ReAct" suffix is omitted from their labels in the figure.
  • Figure 2: Average # of LLM Calls and Latency (left). Prompt and Completion Tokens (right).
  • Figure 3: Error analysis on DC 3-shot+ReAct and Retrieval+DC 3-shot+ReAct workflows.
  • Figure 4: Workflows diagrams of ReAct: SW > EX <> SR (left), Verification: SW > EX <> SR <> FP (middle), Retrieval-based: KE > (ER $\parallel$ CR) > SW > EX <> SR (right) .
  • Figure 5: Execution Accuracy and R-VES Score
  • ...and 2 more figures