Fine-Grained Named Entities for Corona News
Sefika Efeoglu, Adrian Paschke
TL;DR
This paper addresses the need for up-to-date corona-news named entity recognition by building a specialized annotation pipeline that produces a 23-type corpus from Tagesschau articles. It employs gold-domain seeds and Wikidata-derived silver seeds across parallel health and generic entity annotations, harmonized to favor domain-specific health entities, and trains NER models using Flair and SciBERT. Evaluation on expertly labeled test data shows contextual embeddings yield stronger micro-F1 scores than word-only baselines, with SciBERT performing well on domain-specific types. The resulting up-to-date corpus and modeling approach enable improved extraction of corona-related mentions from news, supporting downstream analysis of pandemic dynamics and information flow; future work includes spelling corrections prior to translation to further enhance data quality.
Abstract
Information resources such as newspapers have produced unstructured text data in various languages related to the corona outbreak since December 2019. Analyzing these unstructured texts is time-consuming without representing them in a structured format; therefore, representing them in a structured format is crucial. An information extraction pipeline with essential tasks -- named entity tagging and relation extraction -- to accomplish this goal might be applied to these texts. This study proposes a data annotation pipeline to generate training data from corona news articles, including generic and domain-specific entities. Named entity recognition models are trained on this annotated corpus and then evaluated on test sentences manually annotated by domain experts evaluating the performance of a trained model. The code base and demonstration are available at https://github.com/sefeoglu/coronanews-ner.git.
