Guided Distant Supervision for Multilingual Relation Extraction Data: Adapting to a New Language
Alistair Plum, Tharindu Ranasinghe, Christoph Purschke
TL;DR
This work adapts guided distant supervision (GDS) to German to enable large-scale biographical relation extraction with limited manual annotation. By leveraging Pantheon, Wikidata, and German Wikipedia, the authors build a German RE dataset with over 80,000 instances across nine relations and provide a 2,000-sentence gold standard for evaluation, along with pretrained models. They systematically evaluate monolingual, cross-lingual, and multilingual transfer using state-of-the-art transformer models (BERT, XLM-R), demonstrating that cross-lingual and multilingual training can approach or match monolingual performance and are viable for low-resource languages. The study highlights practical benefits for Digital Humanities and language-resource development, and releases resources to the community for further exploration of multilingual biographical RE.
Abstract
Relation extraction is essential for extracting and understanding biographical information in the context of digital humanities and related subjects. There is a growing interest in the community to build datasets capable of training machine learning models to extract relationships. However, annotating such datasets can be expensive and time-consuming, in addition to being limited to English. This paper applies guided distant supervision to create a large biographical relationship extraction dataset for German. Our dataset, composed of more than 80,000 instances for nine relationship types, is the largest biographical German relationship extraction dataset. We also create a manually annotated dataset with 2000 instances to evaluate the models and release it together with the dataset compiled using guided distant supervision. We train several state-of-the-art machine learning models on the automatically created dataset and release them as well. Furthermore, we experiment with multilingual and cross-lingual experiments that could benefit many low-resource languages.
