FRASE: Structured Representations for Generalizable SPARQL Query Generation
Papa Abdou Karim Karou Diallo, Amal Zouaq
TL;DR
FRASE introduces a frame-semantic augmentation pipeline for NL-to-SPARQL generation to overcome generalization gaps caused by template-based training. It uses a retrieval-augmented frame detection stage, grounded argument identification, and LCQ3 as a frame-enriched variant of LCQ2, feeding an instruction-tuned LLM for SPARQL generation. Evaluations across Original and Unknown Template splits, plus reformulated (template-free) questions, show consistent improvements in execution-based metrics when frame information is incorporated, with maximal gains when training and testing on frame-augmented, combined question sets. The work demonstrates that structured semantic representations grounded in FrameNet frames can substantially enhance robustness and abstraction in KB querying, with implications for multilingual and broader-NLP applications.
Abstract
Translating natural language questions into SPARQL queries enables Knowledge Base querying for factual and up-to-date responses. However, existing datasets for this task are predominantly template-based, leading models to learn superficial mappings between question and query templates rather than developing true generalization capabilities. As a result, models struggle when encountering naturally phrased, template-free questions. This paper introduces FRASE (FRAme-based Semantic Enhancement), a novel approach that leverages Frame Semantic Role Labeling (FSRL) to address this limitation. We also present LC-QuAD 3.0, a new dataset derived from LC-QuAD 2.0, in which each question is enriched using FRASE through frame detection and the mapping of frame-elements to their argument. We evaluate the impact of this approach through extensive experiments on recent large language models (LLMs) under different fine-tuning configurations. Our results demonstrate that integrating frame-based structured representations consistently improves SPARQL generation performance, particularly in challenging generalization scenarios when test questions feature unseen templates (unknown template splits) and when they are all naturally phrased (reformulated questions).
