Table of Contents
Fetching ...

Meta-aware Learning in text-to-SQL Large Language Model

Wenda Zhang

TL;DR

This paper tackles text-to-SQL in complex business databases where large language models struggle with schema complexity and domain knowledge, including BigQuery SQL dialects. It introduces meta-aware learning that fuses four complementary strategies—schema-based learning, Chain-of-Thought (CoT) reasoning, domain knowledge enhancement, and key information tokenization—through fine-tuning to tailor SQL generation to domain contexts. Experiments on Walmart business data across two scenarios demonstrate higher execution accuracy, improved multi-task SQL capabilities, and reduced catastrophic forgetting compared with baselines. The work offers practical guidance on structured prompts, knowledge management, and tokenization to support robust, domain-adapted text-to-SQL systems, with potential for better long-context handling and cross-domain retrieval in future work.

Abstract

The advancements of Large language models (LLMs) have provided great opportunities to text-to-SQL tasks to overcome the main challenges to understand complex domain information and complex database structures in business applications. In this paper, we propose a meta-aware learning framework to integrate domain knowledge, database schema, chain-of-thought reasoning processes, and metadata relationships to improve the SQL generation quality. The proposed framework includes four learning strategies: schema-based learning, Chain-of-Thought (CoT) learning, knowledge-enhanced learning, and key information tokenization. This approach provides a comprehensive understanding of database structure and metadata information towards LLM through fine-tuning to improve its performance on SQL generation within business domains. Through two experimental studies, we have demonstrated the superiority of the proposed methods in execution accuracy, multi-task SQL generation capability, and reduction of catastrophic forgetting.

Meta-aware Learning in text-to-SQL Large Language Model

TL;DR

This paper tackles text-to-SQL in complex business databases where large language models struggle with schema complexity and domain knowledge, including BigQuery SQL dialects. It introduces meta-aware learning that fuses four complementary strategies—schema-based learning, Chain-of-Thought (CoT) reasoning, domain knowledge enhancement, and key information tokenization—through fine-tuning to tailor SQL generation to domain contexts. Experiments on Walmart business data across two scenarios demonstrate higher execution accuracy, improved multi-task SQL capabilities, and reduced catastrophic forgetting compared with baselines. The work offers practical guidance on structured prompts, knowledge management, and tokenization to support robust, domain-adapted text-to-SQL systems, with potential for better long-context handling and cross-domain retrieval in future work.

Abstract

The advancements of Large language models (LLMs) have provided great opportunities to text-to-SQL tasks to overcome the main challenges to understand complex domain information and complex database structures in business applications. In this paper, we propose a meta-aware learning framework to integrate domain knowledge, database schema, chain-of-thought reasoning processes, and metadata relationships to improve the SQL generation quality. The proposed framework includes four learning strategies: schema-based learning, Chain-of-Thought (CoT) learning, knowledge-enhanced learning, and key information tokenization. This approach provides a comprehensive understanding of database structure and metadata information towards LLM through fine-tuning to improve its performance on SQL generation within business domains. Through two experimental studies, we have demonstrated the superiority of the proposed methods in execution accuracy, multi-task SQL generation capability, and reduction of catastrophic forgetting.

Paper Structure

This paper contains 16 sections, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: The comparison in SQL complexity from public database and the business database. In the question from business database, YTD refers to "year to date" and the definition of years differs from the standard calendar year. The date information is maintained across a separate table. The key information is masked to present the SQL structures.
  • Figure 2: The framework of meta-aware learning with sub-modules: (a) schema-based learning module with prompt structure; (b) Chain-of-Thought (CoT) learning module with step-by-step reasoning process; (c) domain knowledge enhancement learning module with sub-tasks and relationships; (d) key information tokenization module with tokenized elements.
  • Figure 3: Performance comparison plots between tokenized-prompt training (prompt structure tokenization) and based-prompt-I (Table \ref{['tab:prompt_structures']}.d) training strategies. Bar plot shows the steps to observe overfitting across different training set sample sizes. The line plots present the accuracy comparison of the two strategies.