Table of Contents
Fetching ...

Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark

Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Šuppa, Hila Gonen, Joseph Marvin Imperial, Börje F. Karlsson, Peiqin Lin, Nikola Ljubešić, LJ Miranda, Barbara Plank, Arij Riabi, Yuval Pinter

TL;DR

Universal NER (UNER) tackles the lack of gold-standard, multilingual NER benchmarks by creating a community-driven, cross-lingually consistent annotation framework aligned with Universal Dependencies. UNER v1 adds 19 datasets across 13 languages, using a simple PER/ORG/LOC tagset and BIO2 tagging to enable standardized evaluation and cross-lingual transfer analyses. Baseline experiments with XLM-RLarge reveal strong in-language performance and variable cross-lingual transfer, with European languages transferring more readily than Chinese or North-African varieties, highlighting script and typology challenges. The work demonstrates the value of collaboratively built, openly available, multilingual NER resources and outlines paths for expansion, quality control, and deeper cross-lingual analysis that can broadly impact multilingual NLP research and evaluation.

Abstract

We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We release the data, code, and fitted models to the public.

Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark

TL;DR

Universal NER (UNER) tackles the lack of gold-standard, multilingual NER benchmarks by creating a community-driven, cross-lingually consistent annotation framework aligned with Universal Dependencies. UNER v1 adds 19 datasets across 13 languages, using a simple PER/ORG/LOC tagset and BIO2 tagging to enable standardized evaluation and cross-lingual transfer analyses. Baseline experiments with XLM-RLarge reveal strong in-language performance and variable cross-lingual transfer, with European languages transferring more readily than Chinese or North-African varieties, highlighting script and typology challenges. The work demonstrates the value of collaboratively built, openly available, multilingual NER resources and outlines paths for expansion, quality control, and deeper cross-lingual analysis that can broadly impact multilingual NLP research and evaluation.

Abstract

We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We release the data, code, and fitted models to the public.
Paper Structure (35 sections, 5 figures, 6 tables)

This paper contains 35 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Parallel sentences annotated with person (PER) and location (LOC) named entities in English (EN), German (DE), Russian (RU), and Chinese (ZH).
  • Figure 2: Distribution of tags in different UNER training sets. zh_gsdsimp has the same distribution as zh_gsd.
  • Figure 3: Cross-lingual comparison of NER Annotations on top of PUD treebanks. Left: Global distribution of tags for each PUD language. Center: Sentence-level agreement between languages for the number of entities. Right: Sentence-level agreement between languages for the identity of entities.
  • Figure 4: Heatmap of micro F1 scores on test sets with different fine-tuned models. The y-axis indicates the dataset that the model is fine-tuned on, and the x-axis indicates the datasets that the models are evaluated on. Left: Model performance on datasets that contains the train, dev and test splits. The highlighted diagonal cells are the in-dataset results. Center: Model performance on the PUD datasets. Right: Model performance on all other datasets.
  • Figure 5: F1 scores of each UNER test set after finetuning XLM-RLarge on all training sets.