ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata

Jonne Sälevä; Constantine Lignos

ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata

Jonne Sälevä, Constantine Lignos

TL;DR

ParaNames addresses the need for a massively multilingual parallel name resource by harvesting 140 million names for 16.8 million Wikidata entities across 400+ languages. It builds an automated pipeline that filters, standardizes scripts, assigns high-level entity types, and aligns cross-language names, enabling canonical name translation and gazetteer-based multilingual NER. A character-level Transformer trained on ParaNames demonstrates cross-language name translation with language- and type-aware tokens improving performance, and the gazetteer-based NER experiments show consistent gains in several languages, with coverage being a key factor. Released under CC BY 4.0, ParaNames supports broader multilingual NLP research and practical applications in under-resourced languages, including as a component for future benchmarks and evaluation datasets.

Abstract

We introduce ParaNames, a massively multilingual parallel name resource consisting of 140 million names spanning over 400 languages. Names are provided for 16.8 million entities, and each entity is mapped from a complex type hierarchy to a standard type (PER/LOC/ORG). Using Wikidata as a source, we create the largest resource of this type to date. We describe our approach to filtering and standardizing the data to provide the best quality possible. ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. We demonstrate the usefulness of ParaNames on two tasks. First, we perform canonical name translation between English and 17 other languages. Second, we use it as a gazetteer for multilingual named entity recognition, obtaining performance improvements on all 10 languages evaluated.

ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata

TL;DR

Abstract

ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata

Authors

TL;DR

Abstract

Table of Contents