HistNERo: Historical Named Entity Recognition for the Romanian Language
Andrei-Marius Avram, Andreea Iuga, George-Vlad Manolache, Vlad-Cristian Matei, Răzvan-Gabriel Micliuş, Vlad-Andrei Muntean, Manuel-Petru Sorlescu, Dragoş-Andrei Şerban, Adrian-Dinu Urse, Vasile Păiş, Dumitru-Clementin Cercel
TL;DR
HistNERo presents the first historical Romanian NER resource, spanning 1817–1990 with 323,865 tokens across four historical regions and five entity types. It evaluates multiple Romanian pretrained LMs and introduces a simple loss reversal domain adaptation to learn region-invariant features, achieving a strict F1 of $66.80\%$ (RoBERT-large with loss reversal) and high overall accuracy ($97.64\%$). The dataset construction, annotation protocol, and TF-IDF analyses illuminate regional linguistic variation, while the RELATE platform supports reproducible annotation work. This resource enables robust historical Romanian NLP and establishes a baseline for cross-region NER in morphologically rich historical corpora, with potential integration into the LiRo benchmark.
Abstract
This work introduces HistNERo, the first Romanian corpus for Named Entity Recognition (NER) in historical newspapers. The dataset contains 323k tokens of text, covering more than half of the 19th century (i.e., 1817) until the late part of the 20th century (i.e., 1990). Eight native Romanian speakers annotated the dataset with five named entities. The samples belong to one of the following four historical regions of Romania, namely Bessarabia, Moldavia, Transylvania, and Wallachia. We employed this proposed dataset to perform several experiments for NER using Romanian pre-trained language models. Our results show that the best model achieved a strict F1-score of 55.69%. Also, by reducing the discrepancies between regions through a novel domain adaption technique, we improved the performance on this corpus to a strict F1-score of 66.80%, representing an absolute gain of more than 10%.
