Table of Contents
Fetching ...

OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages

Chester Palen-Michel, Maxwell Pickering, Maya Kruse, Jonne Sälevä, Constantine Lignos

TL;DR

OpenNER 1.0 provides the first large-scale, openly accessible, multilingual NER benchmark by standardizing 36 corpora across 52 languages into a uniform representation with a core-ontology option. It details a rigorous data-selection and standardization pipeline (CoNLL formatting, BIO encoding, label repair, and cross-dataset type mappings) and presents baseline results from three pretrained multilingual encoders and two LLMs, revealing no universal winner and highlighting the promise and current limits of multilingual transfer and LLM-based NER. The work delivers a reproducible, extensible resource and a framework for evaluating cross-lingual and multi-ontology NER, with clear pathways for future expansion and methodological improvement. This resource is poised to accelerate multilingual NER research and cross-dataset evaluation by providing high-quality, uniformly formatted data and comprehensive baselines.

Abstract

We present OpenNER 1.0, a standardized collection of openly-available named entity recognition (NER) datasets. OpenNER contains 36 NER corpora that span 52 languages, human-annotated in varying named entity ontologies. We correct annotation format issues, standardize the original datasets into a uniform representation with consistent entity type names across corpora, and provide the collection in a structure that enables research in multilingual and multi-ontology NER. We provide baseline results using three pretrained multilingual language models and two large language models to compare the performance of recent models and facilitate future research in NER. We find that no single model is best in all languages and that significant work remains to obtain high performance from LLMs on the NER task. OpenNER is released at https://github.com/bltlab/open-ner.

OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages

TL;DR

OpenNER 1.0 provides the first large-scale, openly accessible, multilingual NER benchmark by standardizing 36 corpora across 52 languages into a uniform representation with a core-ontology option. It details a rigorous data-selection and standardization pipeline (CoNLL formatting, BIO encoding, label repair, and cross-dataset type mappings) and presents baseline results from three pretrained multilingual encoders and two LLMs, revealing no universal winner and highlighting the promise and current limits of multilingual transfer and LLM-based NER. The work delivers a reproducible, extensible resource and a framework for evaluating cross-lingual and multi-ontology NER, with clear pathways for future expansion and methodological improvement. This resource is poised to accelerate multilingual NER research and cross-dataset evaluation by providing high-quality, uniformly formatted data and comprehensive baselines.

Abstract

We present OpenNER 1.0, a standardized collection of openly-available named entity recognition (NER) datasets. OpenNER contains 36 NER corpora that span 52 languages, human-annotated in varying named entity ontologies. We correct annotation format issues, standardize the original datasets into a uniform representation with consistent entity type names across corpora, and provide the collection in a structure that enables research in multilingual and multi-ontology NER. We provide baseline results using three pretrained multilingual language models and two large language models to compare the performance of recent models and facilitate future research in NER. We find that no single model is best in all languages and that significant work remains to obtain high performance from LLMs on the NER task. OpenNER is released at https://github.com/bltlab/open-ner.

Paper Structure

This paper contains 32 sections, 5 figures, 11 tables.

Figures (5)

  • Figure 1: The processing pipeline for OpenNER. Existing datasets (magenta) pass through a series of stages of standardization (blue) to produce two final versions of the dataset (green).
  • Figure 2: Mean F1 for each dataset-language combination, using all entity types present in each dataset. Models were fine-tuned individually on each dataset-language combination.
  • Figure 3: Mean F1 for each dataset-language combination, using only core entity types (location, organization, and person). Models were fine-tuned individually on each dataset-language combination.
  • Figure 4: Mean F1 for each dataset-language combination, using only core entity types (location, organization, and person). Multilingual models were fine-tuned using all datasets and languages.
  • Figure 5: Violin plot of F1 distributions per model, with points depicting mean F1 scores across random seeds for each language-dataset combination. White lines indicate means of all points per-model.