Table of Contents
Fetching ...

TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data

Xiang Huang, Jiayu Shen, Shanshan Huang, Sitao Cheng, Xiaxia Wang, Yuzhong Qu

TL;DR

TARGA tackles two core challenges in semantic parsing for KBQA: dependence on manually annotated data and poor generalization to unseen questions. It creates targeted synthetic demonstrations by expanding from relevant KB items into valid query graphs, then re-ranks and textifies these queries to produce NLQ-Query pairs for in-context learning with a 7B open LLM. Empirical results show substantial gains over non-fine-tuned baselines on GrailQA and KBQA-Agent, along with strong sample efficiency, robustness, and cross-task transferability to Text2SQL. The approach demonstrates practical, annotation-free, data-efficient reasoning over structured data and highlights the potential of online synthetic data generation for real-world KBQA systems.

Abstract

Semantic parsing, which converts natural language questions into logic forms, plays a crucial role in reasoning within structured environments. However, existing methods encounter two significant challenges: reliance on extensive manually annotated datasets and limited generalization capability to unseen examples. To tackle these issues, we propose Targeted Synthetic Data Generation (TARGA), a practical framework that dynamically generates high-relevance synthetic data without manual annotation. Starting from the pertinent entities and relations of a given question, we probe for the potential relevant queries through layer-wise expansion and cross-layer combination. Then we generate corresponding natural language questions for these constructed queries to jointly serve as the synthetic demonstrations for in-context learning. Experiments on multiple knowledge base question answering (KBQA) datasets demonstrate that TARGA, using only a 7B-parameter model, substantially outperforms existing non-fine-tuned methods that utilize close-sourced model, achieving notable improvements in F1 scores on GrailQA(+7.7) and KBQA-Agent(+12.2). Furthermore, TARGA also exhibits superior sample efficiency, robustness, and generalization capabilities under non-I.I.D. settings.

TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data

TL;DR

TARGA tackles two core challenges in semantic parsing for KBQA: dependence on manually annotated data and poor generalization to unseen questions. It creates targeted synthetic demonstrations by expanding from relevant KB items into valid query graphs, then re-ranks and textifies these queries to produce NLQ-Query pairs for in-context learning with a 7B open LLM. Empirical results show substantial gains over non-fine-tuned baselines on GrailQA and KBQA-Agent, along with strong sample efficiency, robustness, and cross-task transferability to Text2SQL. The approach demonstrates practical, annotation-free, data-efficient reasoning over structured data and highlights the potential of online synthetic data generation for real-world KBQA systems.

Abstract

Semantic parsing, which converts natural language questions into logic forms, plays a crucial role in reasoning within structured environments. However, existing methods encounter two significant challenges: reliance on extensive manually annotated datasets and limited generalization capability to unseen examples. To tackle these issues, we propose Targeted Synthetic Data Generation (TARGA), a practical framework that dynamically generates high-relevance synthetic data without manual annotation. Starting from the pertinent entities and relations of a given question, we probe for the potential relevant queries through layer-wise expansion and cross-layer combination. Then we generate corresponding natural language questions for these constructed queries to jointly serve as the synthetic demonstrations for in-context learning. Experiments on multiple knowledge base question answering (KBQA) datasets demonstrate that TARGA, using only a 7B-parameter model, substantially outperforms existing non-fine-tuned methods that utilize close-sourced model, achieving notable improvements in F1 scores on GrailQA(+7.7) and KBQA-Agent(+12.2). Furthermore, TARGA also exhibits superior sample efficiency, robustness, and generalization capabilities under non-I.I.D. settings.
Paper Structure (48 sections, 4 equations, 5 figures, 14 tables)

This paper contains 48 sections, 4 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: Compared with previous methods, Targa aims to mitigate the reliance on large amounts of manually labeled data and enhance generalization capabilities in non-i.i.d. scenarios.
  • Figure 2: Overview of Targa.
  • Figure 3: Performance with various numbers of demonstrations on GrailQA (1,000 randomly sampled questions).
  • Figure 4: Performance under attack setting on 1,000 randomly sampled GrailQA questions. Attack level indicates how many demonstrations have been corrupted.
  • Figure 5: Performance under attack setting(entity) on 1,000 randomly sampled GrailQA questions.