Table of Contents
Fetching ...

Wikidata as a seed for Web Extraction

Kunpeng Guo, Dennis Diefenbach, Antoine Gourru, Christophe Gravier

TL;DR

This work tackles the challenge of enriching Wikidata with facts hidden across the Web by introducing WebExtractor, a QA-based framework that uses Wikidata as seed to extract facts from HTML pages. It casts the extraction task as extractive question answering, trains language models with signals derived from Wikidata, and integrates an object-linking step plus a Wikidata editor-facing gadget (WikidataComplete) to ensure high-quality additions. Across 54 domains and multiple properties, the approach achieves an average F1 of 84.07 in supervised settings and demonstrates strong zero-shot and few-shot transfer when using a pre-trained WebExtractor model, suggesting potential to generate millions of candidate facts for editor validation. This enables scalable, domain-agnostic extraction from heterogeneous Web sources and has practical impact for accelerating Wikidata completion and knowledge graph maintenance. Future work includes expanding domain coverage and multilingual capabilities, further improving disambiguation in object linking, and tightening end-to-end editorial workflows.

Abstract

Wikidata has grown to a knowledge graph with an impressive size. To date, it contains more than 17 billion triples collecting information about people, places, films, stars, publications, proteins, and many more. On the other side, most of the information on the Web is not published in highly structured data repositories like Wikidata, but rather as unstructured and semi-structured content, more concretely in HTML pages containing text and tables. Finding, monitoring, and organizing this data in a knowledge graph is requiring considerable work from human editors. The volume and complexity of the data make this task difficult and time-consuming. In this work, we present a framework that is able to identify and extract new facts that are published under multiple Web domains so that they can be proposed for validation by Wikidata editors. The framework is relying on question-answering technologies. We take inspiration from ideas that are used to extract facts from textual collections and adapt them to extract facts from Web pages. For achieving this, we demonstrate that language models can be adapted to extract facts not only from textual collections but also from Web pages. By exploiting the information already contained in Wikidata the proposed framework can be trained without the need for any additional learning signals and can extract new facts for a wide range of properties and domains. Following this path, Wikidata can be used as a seed to extract facts on the Web. Our experiments show that we can achieve a mean performance of 84.07 at F1-score. Moreover, our estimations show that we can potentially extract millions of facts that can be proposed for human validation. The goal is to help editors in their daily tasks and contribute to the completion of the Wikidata knowledge graph.

Wikidata as a seed for Web Extraction

TL;DR

This work tackles the challenge of enriching Wikidata with facts hidden across the Web by introducing WebExtractor, a QA-based framework that uses Wikidata as seed to extract facts from HTML pages. It casts the extraction task as extractive question answering, trains language models with signals derived from Wikidata, and integrates an object-linking step plus a Wikidata editor-facing gadget (WikidataComplete) to ensure high-quality additions. Across 54 domains and multiple properties, the approach achieves an average F1 of 84.07 in supervised settings and demonstrates strong zero-shot and few-shot transfer when using a pre-trained WebExtractor model, suggesting potential to generate millions of candidate facts for editor validation. This enables scalable, domain-agnostic extraction from heterogeneous Web sources and has practical impact for accelerating Wikidata completion and knowledge graph maintenance. Future work includes expanding domain coverage and multilingual capabilities, further improving disambiguation in object linking, and tightening end-to-end editorial workflows.

Abstract

Wikidata has grown to a knowledge graph with an impressive size. To date, it contains more than 17 billion triples collecting information about people, places, films, stars, publications, proteins, and many more. On the other side, most of the information on the Web is not published in highly structured data repositories like Wikidata, but rather as unstructured and semi-structured content, more concretely in HTML pages containing text and tables. Finding, monitoring, and organizing this data in a knowledge graph is requiring considerable work from human editors. The volume and complexity of the data make this task difficult and time-consuming. In this work, we present a framework that is able to identify and extract new facts that are published under multiple Web domains so that they can be proposed for validation by Wikidata editors. The framework is relying on question-answering technologies. We take inspiration from ideas that are used to extract facts from textual collections and adapt them to extract facts from Web pages. For achieving this, we demonstrate that language models can be adapted to extract facts not only from textual collections but also from Web pages. By exploiting the information already contained in Wikidata the proposed framework can be trained without the need for any additional learning signals and can extract new facts for a wide range of properties and domains. Following this path, Wikidata can be used as a seed to extract facts on the Web. Our experiments show that we can achieve a mean performance of 84.07 at F1-score. Moreover, our estimations show that we can potentially extract millions of facts that can be proposed for human validation. The goal is to help editors in their daily tasks and contribute to the completion of the Wikidata knowledge graph.
Paper Structure (15 sections, 1 equation, 6 figures, 4 tables)

This paper contains 15 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Diagram that demonstrates the main pipeline of our framework WebExtractor. The framework consists of different modules namely (clock-wise): knowledge selection (which identifies facts to be completed), data cleaning (which fetches websites that can contain the underlying fact and perform general cleaning), relation extraction (which extracts the actual fact from a website), object-linking (which links the identifies object to a Wikidata item), WikidataComplete integration (which proposes extracted facts to users for fact verification).
  • Figure 2: Web extraction from a well-structured field in the website Clinicaltrials.gov. The "study type" for the clinical trial "Klinik - Intelligent Patient Flow Management" is extracted.
  • Figure 3: Web extraction from a semi-structured field in the website ORCID. We extract the "employer" of the researcher "Evzen Amler".
  • Figure 4: Web extraction from an unstructured field in the website MusicBrainz. We extract the "occupation" of "Victor Noriega".
  • Figure 5: WikidataComplete: a Wikidata gadget that is intended to help users in adding more facts to the Wikidata knowledge base. In the statement section, a user can see statements to approve or reject. A reference is given in order to understand on which website the fact was found and what is the evidence for the underlying fact.
  • ...and 1 more figures