HistNERo: Historical Named Entity Recognition for the Romanian Language

Andrei-Marius Avram; Andreea Iuga; George-Vlad Manolache; Vlad-Cristian Matei; Răzvan-Gabriel Micliuş; Vlad-Andrei Muntean; Manuel-Petru Sorlescu; Dragoş-Andrei Şerban; Adrian-Dinu Urse; Vasile Păiş; Dumitru-Clementin Cercel

HistNERo: Historical Named Entity Recognition for the Romanian Language

Andrei-Marius Avram, Andreea Iuga, George-Vlad Manolache, Vlad-Cristian Matei, Răzvan-Gabriel Micliuş, Vlad-Andrei Muntean, Manuel-Petru Sorlescu, Dragoş-Andrei Şerban, Adrian-Dinu Urse, Vasile Păiş, Dumitru-Clementin Cercel

TL;DR

HistNERo presents the first historical Romanian NER resource, spanning 1817–1990 with 323,865 tokens across four historical regions and five entity types. It evaluates multiple Romanian pretrained LMs and introduces a simple loss reversal domain adaptation to learn region-invariant features, achieving a strict F1 of $66.80\%$ (RoBERT-large with loss reversal) and high overall accuracy ($97.64\%$). The dataset construction, annotation protocol, and TF-IDF analyses illuminate regional linguistic variation, while the RELATE platform supports reproducible annotation work. This resource enables robust historical Romanian NLP and establishes a baseline for cross-region NER in morphologically rich historical corpora, with potential integration into the LiRo benchmark.

Abstract

This work introduces HistNERo, the first Romanian corpus for Named Entity Recognition (NER) in historical newspapers. The dataset contains 323k tokens of text, covering more than half of the 19th century (i.e., 1817) until the late part of the 20th century (i.e., 1990). Eight native Romanian speakers annotated the dataset with five named entities. The samples belong to one of the following four historical regions of Romania, namely Bessarabia, Moldavia, Transylvania, and Wallachia. We employed this proposed dataset to perform several experiments for NER using Romanian pre-trained language models. Our results show that the best model achieved a strict F1-score of 55.69%. Also, by reducing the discrepancies between regions through a novel domain adaption technique, we improved the performance on this corpus to a strict F1-score of 66.80%, representing an absolute gain of more than 10%.

HistNERo: Historical Named Entity Recognition for the Romanian Language

TL;DR

(RoBERT-large with loss reversal) and high overall accuracy (

). The dataset construction, annotation protocol, and TF-IDF analyses illuminate regional linguistic variation, while the RELATE platform supports reproducible annotation work. This resource enables robust historical Romanian NLP and establishes a baseline for cross-region NER in morphologically rich historical corpora, with potential integration into the LiRo benchmark.

Abstract

Paper Structure (19 sections, 4 equations, 3 figures, 7 tables)

This paper contains 19 sections, 4 equations, 3 figures, 7 tables.

Introduction
Related Work
Named Entities in Historical Documents
Romanian Named Entity Recognition
HistNERo Corpus
RELATE Platform
Annotation Process
Data Preprocessing
HistNERo Statistics
Dataset Comparison
TF-IDF-based Data Analysis
Method
Baseline Models
Domain Adaptation
Implementation Details
...and 4 more sections

Figures (3)

Figure 1: The loss reversal algorithm applied to the third token of a tokenized sentence. We note that FF stands for a feed-forward layer.
Figure 2: Inter-regional strict F1-scores of the RoBERT-large model.
Figure 3: Clustering of the HistNERo test set embeddings produced by the BERT-large model using t-SNE with and without loss reversal.

HistNERo: Historical Named Entity Recognition for the Romanian Language

TL;DR

Abstract

HistNERo: Historical Named Entity Recognition for the Romanian Language

Authors

TL;DR

Abstract

Table of Contents

Figures (3)