Q-NL Verifier: Leveraging Synthetic Data for Robust Knowledge Graph Question Answering
Tim Schwabe, Louisa Siebel, Patrik Valach, Maribel Acosta
TL;DR
The paper tackles the data bottleneck in KGQA by introducing Q-NL Verifier, a framework that generates synthetic SPARQL→NL translations with LLMs and uses a learned semantic verifier to filter for correctness. The verifier, implemented as bi-encoder and cross-encoder architectures, scores translations with a normalized similarity $s\in[0,1]$, enabling automatic quality control without gold references. Empirical results on LC-QuAD 2.0 show strong generalization across LLMs and human translations, with the verifier outperforming traditional NLP metrics in ranking translations by semantic correctness (rank correlation $\rho(ACC_{manual},\cdot)=1.0$). The approach also improves NL→Q translation when used to train or filter data, and the authors release LC-QuAD 2.0-synth with verifier scores to support robust QA research. Overall, Q-NL Verifier enables scalable, semantically accurate QA data generation and provides practical benefits for dataset cleaning, model training, and live QA systems.
Abstract
Question answering (QA) requires accurately aligning user questions with structured queries, a process often limited by the scarcity of high-quality query-natural language (Q-NL) pairs. To overcome this, we present Q-NL Verifier, an approach to generating high-quality synthetic pairs of queries and NL translations. Our approach relies on large language models (LLMs) to generate semantically precise natural language paraphrases of structured queries. Building on these synthetic Q-NL pairs, we introduce a learned verifier component that automatically determines whether a generated paraphrase is semantically equivalent to the original query. Our experiments with the well-known LC-QuAD 2.0 benchmark show that Q-NL Verifier generalizes well to paraphrases from other models and even human-authored translations. Our approach strongly aligns with human judgments across varying query complexities and outperforms existing NLP metrics in assessing semantic correctness. We also integrate the verifier into QA pipelines, showing that verifier-filtered synthetic data has significantly higher quality in terms of translation correctness and enhances NL to Q translation accuracy. Lastly, we release an updated version of the LC-QuAD 2.0 benchmark containing our synthetic Q-NL pairs and verifier scores, offering a new resource for robust and scalable QA.
