Table of Contents
Fetching ...

DeKeyNLU: Enhancing Natural Language to SQL Generation through Task Decomposition and Keyword Extraction

Jian Chen, Zhenyan Chen, Xuming Hu, Peilin Zhou, Yining Hua, Han Fang, Cissy Hing Yee Choy, Xinmei Ke, Jingfeng Luo, Zixuan Yuan

TL;DR

DeKeyNLU introduces a 1,500-size, meticulously annotated dataset focused on task decomposition and keyword extraction to strengthen NL2SQL NLU. Coupled with the DeKeySQL RAG-based pipeline (UQU, Entity Retrieval, Generation, and Revision), the approach yields substantial SQL generation accuracy gains on the BIRD and Spider benchmarks, notably improving dev EX from 62.31% to 69.10% and 84.2% to 88.7%, respectively. The findings indicate that larger models are better at task decomposition while smaller models excel at keyword extraction, with entity retrieval emerging as a key driver of overall performance. This work demonstrates that dataset-centric NLU refinements and modular RAG architectures can meaningfully improve database question answering for non-technical users and enable more practical NL2SQL systems.

Abstract

Natural Language to SQL (NL2SQL) provides a new model-centric paradigm that simplifies database access for non-technical users by converting natural language queries into SQL commands. Recent advancements, particularly those integrating Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) reasoning, have made significant strides in enhancing NL2SQL performance. However, challenges such as inaccurate task decomposition and keyword extraction by LLMs remain major bottlenecks, often leading to errors in SQL generation. While existing datasets aim to mitigate these issues by fine-tuning models, they struggle with over-fragmentation of tasks and lack of domain-specific keyword annotations, limiting their effectiveness. To address these limitations, we present DeKeyNLU, a novel dataset which contains 1,500 meticulously annotated QA pairs aimed at refining task decomposition and enhancing keyword extraction precision for the RAG pipeline. Fine-tuned with DeKeyNLU, we propose DeKeySQL, a RAG-based NL2SQL pipeline that employs three distinct modules for user question understanding, entity retrieval, and generation to improve SQL generation accuracy. We benchmarked multiple model configurations within DeKeySQL RAG pipeline. Experimental results demonstrate that fine-tuning with DeKeyNLU significantly improves SQL generation accuracy on both BIRD (62.31% to 69.10%) and Spider (84.2% to 88.7%) dev datasets.

DeKeyNLU: Enhancing Natural Language to SQL Generation through Task Decomposition and Keyword Extraction

TL;DR

DeKeyNLU introduces a 1,500-size, meticulously annotated dataset focused on task decomposition and keyword extraction to strengthen NL2SQL NLU. Coupled with the DeKeySQL RAG-based pipeline (UQU, Entity Retrieval, Generation, and Revision), the approach yields substantial SQL generation accuracy gains on the BIRD and Spider benchmarks, notably improving dev EX from 62.31% to 69.10% and 84.2% to 88.7%, respectively. The findings indicate that larger models are better at task decomposition while smaller models excel at keyword extraction, with entity retrieval emerging as a key driver of overall performance. This work demonstrates that dataset-centric NLU refinements and modular RAG architectures can meaningfully improve database question answering for non-technical users and enable more practical NL2SQL systems.

Abstract

Natural Language to SQL (NL2SQL) provides a new model-centric paradigm that simplifies database access for non-technical users by converting natural language queries into SQL commands. Recent advancements, particularly those integrating Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) reasoning, have made significant strides in enhancing NL2SQL performance. However, challenges such as inaccurate task decomposition and keyword extraction by LLMs remain major bottlenecks, often leading to errors in SQL generation. While existing datasets aim to mitigate these issues by fine-tuning models, they struggle with over-fragmentation of tasks and lack of domain-specific keyword annotations, limiting their effectiveness. To address these limitations, we present DeKeyNLU, a novel dataset which contains 1,500 meticulously annotated QA pairs aimed at refining task decomposition and enhancing keyword extraction precision for the RAG pipeline. Fine-tuned with DeKeyNLU, we propose DeKeySQL, a RAG-based NL2SQL pipeline that employs three distinct modules for user question understanding, entity retrieval, and generation to improve SQL generation accuracy. We benchmarked multiple model configurations within DeKeySQL RAG pipeline. Experimental results demonstrate that fine-tuning with DeKeyNLU significantly improves SQL generation accuracy on both BIRD (62.31% to 69.10%) and Spider (84.2% to 88.7%) dev datasets.

Paper Structure

This paper contains 25 sections, 1 equation, 7 figures, 16 tables.

Figures (7)

  • Figure 1: Comparison of advanced NL2SQL methods with DeKeySQL. GPT-4o suffers from incomplete task decomposition and incorrect keyword extraction. Because it lacks a revision module, GPT-4o shows lower code generation accuracy. Methods like MAC-SQL, CHESS, TA-SQL are efficient in either time or cost, but not both.
  • Figure 2: The DeKeyNLU dataset creation workflow. User questions are initially pre-annotated by GPT-4o for tasks (main and sub-tasks), objects, and implementations. These preliminary annotations are then subjected to a rigorous human verification process, where annotators correct and refine both task decomposition and keyword extraction. This involves three rounds of cross-validation. Following this, a final scoring phase identifies any low-scoring annotations, which are then collaboratively reviewed and further refined to produce the final, high-quality DeKeyNLU dataset.
  • Figure 3: Distribution of the number of main tasks, sub-tasks, and keywords per question in the DeKeyNLU dataset. These distributions illustrate the complexity inherent in the questions, reflecting the reasoning and integration capabilities required of NL2SQL models.
  • Figure 4: The DeKeySQL Framework. (1) The user's question is processed by the User Question Understanding (UQU) module using a prompt template, directing an LLM (fine-tuned on DeKeyNLU) to perform keyword extraction and task decomposition. (2) Extracted keywords are fed to the Entity Retrieval module to identify relevant column names, table values, and descriptions from the database. (3) Task decomposition outputs, retrieved entity data, and the original question are then input to the Generation LLM to produce SQL code. (4) If errors occur, the error information and generated SQL are passed to a revision LLM for correction. (5) Finally, the corrected SQL is executed to obtain the answer.
  • Figure 5: Prompt of keyword extraction and task decomposition.
  • ...and 2 more figures