Data Distribution Bottlenecks in Grounding Language Models to Knowledge Bases
Yiheng Shu, Zhiwei Yu
TL;DR
This paper interrogates the robustness of grounding language models to knowledge bases for KBQA under real-world data distribution shifts, arguing that existing benchmarks and protocols inadequately capture practical challenges. It analyzes dataset construction factors and evaluation fairness, then introduces the GAIN data-augmentation framework to expand schema coverage, paraphrase resilience, and cross-dataset transfer. Across GrailQA, GraphQuestions, and SimpleQuestions-Balance, GAIN yields substantial gains in F1 and relation-linking metrics, especially under schema-level generalization and zero-shot settings, while also reducing paraphrase score variability. The findings indicate that while data augmentation and larger models help, robust, real-world KBQA still requires more representative data collection and more realistic evaluation protocols to bridge the gap between benchmarks and deployment realities.
Abstract
Language models (LMs) have already demonstrated remarkable abilities in understanding and generating both natural and formal language. Despite these advances, their integration with real-world environments such as large-scale knowledge bases (KBs) remains an underdeveloped area, affecting applications such as semantic parsing and indulging in "hallucinated" information. This paper is an experimental investigation aimed at uncovering the robustness challenges that LMs encounter when tasked with knowledge base question answering (KBQA). The investigation covers scenarios with inconsistent data distribution between training and inference, such as generalization to unseen domains, adaptation to various language variations, and transferability across different datasets. Our comprehensive experiments reveal that even when employed with our proposed data augmentation techniques, advanced small and large language models exhibit poor performance in various dimensions. While the LM is a promising technology, the robustness of the current form in dealing with complex environments is fragile and of limited practicality because of the data distribution issue. This calls for future research on data collection and LM learning paradims.
