Table of Contents
Fetching ...

Skeletons Matter: Dynamic Data Augmentation for Text-to-Query

Yuchen Ji, Bo Xu, Jie Shi, Jiaqing Liang, Deqing Yang, Yu Mao, Hai Chen, Yanghua Xiao

TL;DR

The paper formalizes Text-to-Query as a unified paradigm that translates natural language questions into diverse query languages via query skeletons. It introduces a dynamic data augmentation framework, Skeletron, built on three components: dynamic diagnosis of skeleton weaknesses, a skeleton generalizer to create novel skeletons, and a skeleton-guided backward-forward data synthesis pipeline to generate high-quality training data verified with chain-of-thought reasoning. Empirical results across Text-to-SQL, Text-to-Cypher, and Text-to-nGQL benchmarks show state-of-the-art performance with only a small amount of synthetic data, highlighting efficiency and generality. This work lays a foundation for unified, skeleton-aware optimization in Text-to-Query tasks and provides a practical, deployable approach for cross-language semantic parsing.

Abstract

The task of translating natural language questions into query languages has long been a central focus in semantic parsing. Recent advancements in Large Language Models (LLMs) have significantly accelerated progress in this field. However, existing studies typically focus on a single query language, resulting in methods with limited generalizability across different languages. In this paper, we formally define the Text-to-Query task paradigm, unifying semantic parsing tasks across various query languages. We identify query skeletons as a shared optimization target of Text-to-Query tasks, and propose a general dynamic data augmentation framework that explicitly diagnoses model-specific weaknesses in handling these skeletons to synthesize targeted training data. Experiments on four Text-to-Query benchmarks demonstrate that our method achieves state-of-the-art performance using only a small amount of synthesized data, highlighting the efficiency and generality of our approach and laying a solid foundation for unified research on Text-to-Query tasks. We release our code at https://github.com/jjjycaptain/Skeletron.

Skeletons Matter: Dynamic Data Augmentation for Text-to-Query

TL;DR

The paper formalizes Text-to-Query as a unified paradigm that translates natural language questions into diverse query languages via query skeletons. It introduces a dynamic data augmentation framework, Skeletron, built on three components: dynamic diagnosis of skeleton weaknesses, a skeleton generalizer to create novel skeletons, and a skeleton-guided backward-forward data synthesis pipeline to generate high-quality training data verified with chain-of-thought reasoning. Empirical results across Text-to-SQL, Text-to-Cypher, and Text-to-nGQL benchmarks show state-of-the-art performance with only a small amount of synthetic data, highlighting efficiency and generality. This work lays a foundation for unified, skeleton-aware optimization in Text-to-Query tasks and provides a practical, deployable approach for cross-language semantic parsing.

Abstract

The task of translating natural language questions into query languages has long been a central focus in semantic parsing. Recent advancements in Large Language Models (LLMs) have significantly accelerated progress in this field. However, existing studies typically focus on a single query language, resulting in methods with limited generalizability across different languages. In this paper, we formally define the Text-to-Query task paradigm, unifying semantic parsing tasks across various query languages. We identify query skeletons as a shared optimization target of Text-to-Query tasks, and propose a general dynamic data augmentation framework that explicitly diagnoses model-specific weaknesses in handling these skeletons to synthesize targeted training data. Experiments on four Text-to-Query benchmarks demonstrate that our method achieves state-of-the-art performance using only a small amount of synthesized data, highlighting the efficiency and generality of our approach and laying a solid foundation for unified research on Text-to-Query tasks. We release our code at https://github.com/jjjycaptain/Skeletron.

Paper Structure

This paper contains 40 sections, 1 equation, 19 figures, 5 tables.

Figures (19)

  • Figure 1: Examples of query skeletons from three different query languages.
  • Figure 2: Our proposed method consists of three key components: (i) Dynamic Diagnosis on Query Skeletons: We analyze model behavior to identify query skeletons it struggles with, constructing an error-prone skeleton set to guide targeted data synthesis. (ii)Skeleton Generalizer: A skeleton generation model is trained on the error-prone set to produce structurally novel skeletons, expanding the diversity of the skeleton pool. (iii) Skeleton-Guided Backward-Forward Data Synthesis: We instantiate skeletons from the pool under diverse schema contexts and synthesize high-quality, targeted training data through a backward-forward generation framework.
  • Figure 3: Comparison of overall error rate (1 - EX) and query skeleton error rate across different LLMs and Skeletron 14B on the BIRD Dev. The method for identifying query skeleton errors follows Section \ref{['sec:error-detect']}.
  • Figure 4: EX on the BIRD and Spider Dev sets under different structural edit distance thresholds used in the dynamic diagnosis step.
  • Figure 5: An example of SQLite database schema.
  • ...and 14 more figures