Table of Contents
Fetching ...

Diagnosing and Mitigating Semantic Inconsistencies in Wikidata's Classification Hierarchy

Shixiong Zhao, Hideaki Takeda

TL;DR

This work tackles the pervasive semantic and structural inconsistencies in Wikidata's classification hierarchy, driven by the混合 use of instance-of (P31) and subclass-of (P279). It introduces a three-stage framework that blends graph-based structural analysis with text-based semantic embeddings to diagnose, quantify, and prioritize risks in taxonomy links. A CME-detection pipeline identifies misuses, a multi-dimensional risk model quantifies entity-level inconsistencies, and a scalable semantic drift detector leverages Sentence-BERT embeddings to enable full-graph analysis. The authors provide a real-time, user-facing interface to inspect risk signals, offering practical guidance for editors and downstream KG applications, with findings suggesting that a risk-aware approach—preserving some useful redundant edges while targeting high-risk links—can improve Wikidata’s reliability without sacrificing its community-driven richness.

Abstract

Wikidata is currently the largest open knowledge graph on the web, encompassing over 120 million entities. It integrates data from various domain-specific databases and imports a substantial amount of content from Wikipedia, while also allowing users to freely edit its content. This openness has positioned Wikidata as a central resource in knowledge graph research and has enabled convenient knowledge access for users worldwide. However, its relatively loose editorial policy has also led to a degree of taxonomic inconsistency. Building on prior work, this study proposes and applies a novel validation method to confirm the presence of classification errors, over-generalized subclass links, and redundant connections in specific domains of Wikidata. We further introduce a new evaluation criterion for determining whether such issues warrant correction and develop a system that allows users to inspect the taxonomic relationships of arbitrary Wikidata entities-leveraging the platform's crowdsourced nature to its full potential.

Diagnosing and Mitigating Semantic Inconsistencies in Wikidata's Classification Hierarchy

TL;DR

This work tackles the pervasive semantic and structural inconsistencies in Wikidata's classification hierarchy, driven by the混合 use of instance-of (P31) and subclass-of (P279). It introduces a three-stage framework that blends graph-based structural analysis with text-based semantic embeddings to diagnose, quantify, and prioritize risks in taxonomy links. A CME-detection pipeline identifies misuses, a multi-dimensional risk model quantifies entity-level inconsistencies, and a scalable semantic drift detector leverages Sentence-BERT embeddings to enable full-graph analysis. The authors provide a real-time, user-facing interface to inspect risk signals, offering practical guidance for editors and downstream KG applications, with findings suggesting that a risk-aware approach—preserving some useful redundant edges while targeting high-risk links—can improve Wikidata’s reliability without sacrificing its community-driven richness.

Abstract

Wikidata is currently the largest open knowledge graph on the web, encompassing over 120 million entities. It integrates data from various domain-specific databases and imports a substantial amount of content from Wikipedia, while also allowing users to freely edit its content. This openness has positioned Wikidata as a central resource in knowledge graph research and has enabled convenient knowledge access for users worldwide. However, its relatively loose editorial policy has also led to a degree of taxonomic inconsistency. Building on prior work, this study proposes and applies a novel validation method to confirm the presence of classification errors, over-generalized subclass links, and redundant connections in specific domains of Wikidata. We further introduce a new evaluation criterion for determining whether such issues warrant correction and develop a system that allows users to inspect the taxonomic relationships of arbitrary Wikidata entities-leveraging the platform's crowdsourced nature to its full potential.

Paper Structure

This paper contains 30 sections, 3 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Error Types in P31/P279 Relations
  • Figure 2: Four evaluation dimensions for entity-level semantic risk.
  • Figure 3: Heatmap of entity counts distributed by adjusted drift score and parent group
  • Figure 4: Stage one Result
  • Figure 5: clean classes with their percentage of instances
  • ...and 2 more figures