Table of Contents
Fetching ...

ORANGE: An Online Reflection ANd GEneration framework with Domain Knowledge for Text-to-SQL

Yiwen Jiao, Tonghui Ren, Yuche Gao, Zhenying He, Yinan Jing, Kai Zhang, X. Sean Wang

TL;DR

Researchers address the semantic gap between general LLM knowledge and domain-specific database semantics in Text-to-SQL by introducing ORANGE, an online self-evolutionary framework that builds a database-specific knowledge base from translation logs. ORANGE uses a three-stage pipeline—Knowledge Decomposition, Knowledge Validation, and Knowledge-Enhanced Translation—augmented by a nested Chain-of-Thought and tuple-semantic tracking to generate faithful, domain-aware knowledge units (k_x,k_y) stored in a memory M. Through in-domain demonstrations selected by semantic similarity and a probability-based validation filter, ORANGE delivers improved execution accuracy (EX) across Bird, Spider, and Science benchmarks and demonstrates robustness across prior SQL generators and foundation models. The work enables continual, autonomous improvement for real-world Text-to-SQL deployment by reusing past translations, reducing semantic errors, and scaling with model capabilities and databases.

Abstract

Large Language Models (LLMs) have demonstrated remarkable progress in translating natural language to SQL, but a significant semantic gap persists between their general knowledge and domain-specific semantics of databases. Historical translation logs constitute a rich source of this missing in-domain knowledge, where SQL queries inherently encapsulate real-world usage patterns of database schema. Existing methods primarily enhance the reasoning process for individual translations but fail to accumulate in-domain knowledge from past translations. We introduce ORANGE, an online self-evolutionary framework that constructs database-specific knowledge bases by parsing SQL queries from translation logs. By accumulating in-domain knowledge that contains schema and data semantics, ORANGE progressively reduces the semantic gap and enhances the accuracy of subsequent SQL translations. To ensure reliability, we propose a novel nested Chain-of-Thought SQL-to-Text strategy with tuple-semantic tracking, which reduces semantic errors during knowledge generation. Experiments on multiple benchmarks confirm the practicality of ORANGE, demonstrating its effectiveness for real-world Text-to-SQL deployment, particularly in handling complex and domain-specific queries.

ORANGE: An Online Reflection ANd GEneration framework with Domain Knowledge for Text-to-SQL

TL;DR

Researchers address the semantic gap between general LLM knowledge and domain-specific database semantics in Text-to-SQL by introducing ORANGE, an online self-evolutionary framework that builds a database-specific knowledge base from translation logs. ORANGE uses a three-stage pipeline—Knowledge Decomposition, Knowledge Validation, and Knowledge-Enhanced Translation—augmented by a nested Chain-of-Thought and tuple-semantic tracking to generate faithful, domain-aware knowledge units (k_x,k_y) stored in a memory M. Through in-domain demonstrations selected by semantic similarity and a probability-based validation filter, ORANGE delivers improved execution accuracy (EX) across Bird, Spider, and Science benchmarks and demonstrates robustness across prior SQL generators and foundation models. The work enables continual, autonomous improvement for real-world Text-to-SQL deployment by reusing past translations, reducing semantic errors, and scaling with model capabilities and databases.

Abstract

Large Language Models (LLMs) have demonstrated remarkable progress in translating natural language to SQL, but a significant semantic gap persists between their general knowledge and domain-specific semantics of databases. Historical translation logs constitute a rich source of this missing in-domain knowledge, where SQL queries inherently encapsulate real-world usage patterns of database schema. Existing methods primarily enhance the reasoning process for individual translations but fail to accumulate in-domain knowledge from past translations. We introduce ORANGE, an online self-evolutionary framework that constructs database-specific knowledge bases by parsing SQL queries from translation logs. By accumulating in-domain knowledge that contains schema and data semantics, ORANGE progressively reduces the semantic gap and enhances the accuracy of subsequent SQL translations. To ensure reliability, we propose a novel nested Chain-of-Thought SQL-to-Text strategy with tuple-semantic tracking, which reduces semantic errors during knowledge generation. Experiments on multiple benchmarks confirm the practicality of ORANGE, demonstrating its effectiveness for real-world Text-to-SQL deployment, particularly in handling complex and domain-specific queries.

Paper Structure

This paper contains 24 sections, 7 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: A Text-to-SQL example.
  • Figure 2: Overview of ORANGE.
  • Figure 3: EX score (%) on Bird dev under various hyper-parameters of ORANGE.
  • Figure 4: Case Study.