Table of Contents
Fetching ...

Automated Knowledge Graph Construction using Large Language Models and Sentence Complexity Modelling

Sydney Anuyah, Mehedi Mahmud Kaushik, Krishna Dwarampudi, Rakesh Shiradkar, Arjan Durresi, Sunandan Chakraborty

TL;DR

CoDe-KG addresses automated knowledge-graph construction from unstructured biomedical text by combining robust coreference resolution with syntactic sentence decomposition. The paper introduces an open-source end-to-end pipeline and substantial resources, including a dataset of over 150k knowledge triples and datasets for sentence complexity and co-reference. It demonstrates that hybrid prompting (CoT+FICL) yields high accuracy in sentence simplification (up to 99.8% exact-match) and competitive macro-F1 scores in relation extraction across benchmarks (65.8% on REBEL, 75.7% on WebNLG2), with ablations showing coreference resolution and decomposition boost recall on rare relations. Overall, the work argues for an open-source, modular approach to knowledge extraction that scales with domain-specific data and prompting strategies.

Abstract

We introduce CoDe-KG, an open-source, end-to-end pipeline for extracting sentence-level knowledge graphs by combining robust coreference resolution with syntactic sentence decomposition. Using our model, we contribute a dataset of over 150,000 knowledge triples, which is open source. We also contribute a training corpus of 7248 rows for sentence complexity, 190 rows of gold human annotations for co-reference resolution using open source lung-cancer abstracts from PubMed, 900 rows of gold human annotations for sentence conversion policies, and 398 triples of gold human annotations. We systematically select optimal prompt-model pairs across five complexity categories, showing that hybrid chain-of-thought and few-shot prompting yields up to 99.8% exact-match accuracy on sentence simplification. On relation extraction (RE), our pipeline achieves 65.8% macro-F1 on REBEL, an 8-point gain over the prior state of the art, and 75.7% micro-F1 on WebNLG2, while matching or exceeding performance on Wiki-NRE and CaRB. Ablation studies demonstrate that integrating coreference and decomposition increases recall on rare relations by over 20%. Code and dataset are available at https://github.com/KaushikMahmud/CoDe-KG_EMNLP_2025

Automated Knowledge Graph Construction using Large Language Models and Sentence Complexity Modelling

TL;DR

CoDe-KG addresses automated knowledge-graph construction from unstructured biomedical text by combining robust coreference resolution with syntactic sentence decomposition. The paper introduces an open-source end-to-end pipeline and substantial resources, including a dataset of over 150k knowledge triples and datasets for sentence complexity and co-reference. It demonstrates that hybrid prompting (CoT+FICL) yields high accuracy in sentence simplification (up to 99.8% exact-match) and competitive macro-F1 scores in relation extraction across benchmarks (65.8% on REBEL, 75.7% on WebNLG2), with ablations showing coreference resolution and decomposition boost recall on rare relations. Overall, the work argues for an open-source, modular approach to knowledge extraction that scales with domain-specific data and prompting strategies.

Abstract

We introduce CoDe-KG, an open-source, end-to-end pipeline for extracting sentence-level knowledge graphs by combining robust coreference resolution with syntactic sentence decomposition. Using our model, we contribute a dataset of over 150,000 knowledge triples, which is open source. We also contribute a training corpus of 7248 rows for sentence complexity, 190 rows of gold human annotations for co-reference resolution using open source lung-cancer abstracts from PubMed, 900 rows of gold human annotations for sentence conversion policies, and 398 triples of gold human annotations. We systematically select optimal prompt-model pairs across five complexity categories, showing that hybrid chain-of-thought and few-shot prompting yields up to 99.8% exact-match accuracy on sentence simplification. On relation extraction (RE), our pipeline achieves 65.8% macro-F1 on REBEL, an 8-point gain over the prior state of the art, and 75.7% micro-F1 on WebNLG2, while matching or exceeding performance on Wiki-NRE and CaRB. Ablation studies demonstrate that integrating coreference and decomposition increases recall on rare relations by over 20%. Code and dataset are available at https://github.com/KaushikMahmud/CoDe-KG_EMNLP_2025

Paper Structure

This paper contains 39 sections, 40 equations, 3 figures, 12 tables, 5 algorithms.

Figures (3)

  • Figure 1: Overview of CoDe-KG, the automated KG creation pipeline. First, the input set of abstracts is given to the Coreference Resolution stage. In this phase, a team of annotators , a collection of prompt strategies , and models are jointly applied to produce the coreference‐resolved abstract set , which is given as input in the Sentence Classification stage. With the help of verifiers , prompting strategies and models , a list of correctly classified sentences with labels is generated in this stage. Then, in the Converting Sentences to Simple stage, $\tilde{S}_{{comx},\; {comp},\; {comx\_comp}}$, prompt strategies , and models are given as input and converted into simple sentences $\tilde{S}_{simp}$. In Relationship Extraction stage, $\tilde{S}_{simp}$, $S_{\mathrm{init}}$ and best model--prompt pair $(P^*,M^*)$ from previous stage are given as input and relationships ($\text{entity}_1$, $\text{relationship}$, $\text{entity}_2$) are extracted for constructing KG.
  • Figure 2: Distribution of error bucketization categories.
  • Figure 3: Subsection of the Knowledge Graph.