Table of Contents
Fetching ...

The Wikidata Query Logs Dataset

Sebastian Walter, Hannah Bast

TL;DR

The paper presents WDQL, a 200k-question–query dataset for Wikidata derived from real WDQS logs and de-anonymized via an agent-based SPARQL-to-Question pipeline. It details the GRASP-based S2Q agent, a comprehensive creation workflow from anonymized logs, and extensive statistics showing WDQL's breadth and diversity of SPARQL constructs. A case study with KGQA models demonstrates significant performance gains when training on WDQL, underscoring its value for training robust question-answering over knowledge graphs. The work also provides open-source assets and a general methodology for converting anonymized query logs into NLQ pairs, with potential applicability to other knowledge graphs.

Abstract

We present the Wikidata Query Logs (WDQL) dataset, a dataset consisting of 200k question-query pairs over the Wikidata knowledge graph. It is over 6x larger than the largest existing Wikidata datasets of similar format without relying on template-generated queries. Instead, we construct it using real-world SPARQL queries sent to the Wikidata Query Service and generate questions for them. Since these log-based queries are anonymized, and therefore often do not produce results, a significant amount of effort is needed to convert them back into meaningful SPARQL queries. To achieve this, we present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikidata while also generating corresponding natural-language questions. We demonstrate the dataset's benefit for training question-answering methods. All WDQL assets, as well as the agent code, are publicly available under a permissive license.

The Wikidata Query Logs Dataset

TL;DR

The paper presents WDQL, a 200k-question–query dataset for Wikidata derived from real WDQS logs and de-anonymized via an agent-based SPARQL-to-Question pipeline. It details the GRASP-based S2Q agent, a comprehensive creation workflow from anonymized logs, and extensive statistics showing WDQL's breadth and diversity of SPARQL constructs. A case study with KGQA models demonstrates significant performance gains when training on WDQL, underscoring its value for training robust question-answering over knowledge graphs. The work also provides open-source assets and a general methodology for converting anonymized query logs into NLQ pairs, with potential applicability to other knowledge graphs.

Abstract

We present the Wikidata Query Logs (WDQL) dataset, a dataset consisting of 200k question-query pairs over the Wikidata knowledge graph. It is over 6x larger than the largest existing Wikidata datasets of similar format without relying on template-generated queries. Instead, we construct it using real-world SPARQL queries sent to the Wikidata Query Service and generate questions for them. Since these log-based queries are anonymized, and therefore often do not produce results, a significant amount of effort is needed to convert them back into meaningful SPARQL queries. To achieve this, we present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikidata while also generating corresponding natural-language questions. We demonstrate the dataset's benefit for training question-answering methods. All WDQL assets, as well as the agent code, are publicly available under a permissive license.
Paper Structure (8 sections, 5 tables)