Table of Contents
Fetching ...

Distantly Supervised Morpho-Syntactic Model for Relation Extraction

Nicolas Gutehrlé, Iana Atanassova

TL;DR

This work tackles scalable relation extraction by using distantly supervised morpho-syntactic patterns harvested from Wikidata and Wikipedia to build a Syntactic Index and a TF-IDF based Semantic Index. It extracts candidate dependency-graph fragments via Shortest Dependency Path between entity designations and classifies them into relations using semantic scores, enabling language-agnostic extraction with minimal labeling. Across six Wikidata/Wikipedia-derived datasets, the method achieves high precision (up to ~0.85) but lower recall and F1, highlighting a trade-off between pattern coverage and labeling noise. The approach enables rapid creation of rule-based IE systems and large weakly labeled datasets for training ML or deep-learning models, with prospects for broader language coverage and improvement through richer pattern catalogs and NER preprocessing.

Abstract

The task of Information Extraction (IE) involves automatically converting unstructured textual content into structured data. Most research in this field concentrates on extracting all facts or a specific set of relationships from documents. In this paper, we present a method for the extraction and categorisation of an unrestricted set of relationships from text. Our method relies on morpho-syntactic extraction patterns obtained by a distant supervision method, and creates Syntactic and Semantic Indices to extract and classify candidate graphs. We evaluate our approach on six datasets built on Wikidata and Wikipedia. The evaluation shows that our approach can achieve Precision scores of up to 0.85, but with lower Recall and F1 scores. Our approach allows to quickly create rule-based systems for Information Extraction and to build annotated datasets to train machine-learning and deep-learning based classifiers.

Distantly Supervised Morpho-Syntactic Model for Relation Extraction

TL;DR

This work tackles scalable relation extraction by using distantly supervised morpho-syntactic patterns harvested from Wikidata and Wikipedia to build a Syntactic Index and a TF-IDF based Semantic Index. It extracts candidate dependency-graph fragments via Shortest Dependency Path between entity designations and classifies them into relations using semantic scores, enabling language-agnostic extraction with minimal labeling. Across six Wikidata/Wikipedia-derived datasets, the method achieves high precision (up to ~0.85) but lower recall and F1, highlighting a trade-off between pattern coverage and labeling noise. The approach enables rapid creation of rule-based IE systems and large weakly labeled datasets for training ML or deep-learning models, with prospects for broader language coverage and improvement through richer pattern catalogs and NER preprocessing.

Abstract

The task of Information Extraction (IE) involves automatically converting unstructured textual content into structured data. Most research in this field concentrates on extracting all facts or a specific set of relationships from documents. In this paper, we present a method for the extraction and categorisation of an unrestricted set of relationships from text. Our method relies on morpho-syntactic extraction patterns obtained by a distant supervision method, and creates Syntactic and Semantic Indices to extract and classify candidate graphs. We evaluate our approach on six datasets built on Wikidata and Wikipedia. The evaluation shows that our approach can achieve Precision scores of up to 0.85, but with lower Recall and F1 scores. Our approach allows to quickly create rule-based systems for Information Extraction and to build annotated datasets to train machine-learning and deep-learning based classifiers.
Paper Structure (13 sections, 4 figures, 6 tables)

This paper contains 13 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Pipeline for collecting and processing data
  • Figure 2: SDP subgraph between "Jeanne" and "Domrémy" for the sentence "Jeanne d'Arc est née à Domrémy" (Joan of Arc was born in Domrémy). Each node is represented by its text, its lemma and its part-of-speech. The node labelled "née" (born) is the anchor node.
  • Figure 3: A graph in the "naître_VERB" (born) entry in the Syntactic Index
  • Figure 4: Classification pipeline using the Syntactic and Semantic Indices