Refining Wikidata Taxonomy using Large Language Models
Yiwen Peng, Thomas Bonald, Mehwish Alam
TL;DR
This paper tackles the problem of Wikidata's messy taxonomy by introducing WiKC, an automated refinement pipeline that combines zero-shot prompting of an open-source LLM with graph mining to prune and restructure the taxonomy. It extracts a grounded taxonomy from Wikidata, filters to a tractable acyclic graph, and then applies a six-step LLM-driven refinement (Cut, Resolve, Merge, Rewire, Filter) to produce a compact, corrected taxonomy. Evaluations show WiKC reduces the taxonomy from millions of edges to roughly 17k classes and 20k links, while achieving higher label coverage and substantially improved entity-typing accuracy across depths compared to the original Wikidata taxonomy. The work offers a practical, reproducible approach with a public WiKC-to-Wikidata mapping, and points to future directions including other open LLMs and trust considerations for automated taxonomy cleaning.
Abstract
Due to its collaborative nature, Wikidata is known to have a complex taxonomy, with recurrent issues like the ambiguity between instances and classes, the inaccuracy of some taxonomic paths, the presence of cycles, and the high level of redundancy across classes. Manual efforts to clean up this taxonomy are time-consuming and prone to errors or subjective decisions. We present WiKC, a new version of Wikidata taxonomy cleaned automatically using a combination of Large Language Models (LLMs) and graph mining techniques. Operations on the taxonomy, such as cutting links or merging classes, are performed with the help of zero-shot prompting on an open-source LLM. The quality of the refined taxonomy is evaluated from both intrinsic and extrinsic perspectives, on a task of entity typing for the latter, showing the practical interest of WiKC.
