Automated Knowledge Graph Construction using Large Language Models and Sentence Complexity Modelling

Sydney Anuyah; Mehedi Mahmud Kaushik; Krishna Dwarampudi; Rakesh Shiradkar; Arjan Durresi; Sunandan Chakraborty

Automated Knowledge Graph Construction using Large Language Models and Sentence Complexity Modelling

Sydney Anuyah, Mehedi Mahmud Kaushik, Krishna Dwarampudi, Rakesh Shiradkar, Arjan Durresi, Sunandan Chakraborty

TL;DR

CoDe-KG addresses automated knowledge-graph construction from unstructured biomedical text by combining robust coreference resolution with syntactic sentence decomposition. The paper introduces an open-source end-to-end pipeline and substantial resources, including a dataset of over 150k knowledge triples and datasets for sentence complexity and co-reference. It demonstrates that hybrid prompting (CoT+FICL) yields high accuracy in sentence simplification (up to 99.8% exact-match) and competitive macro-F1 scores in relation extraction across benchmarks (65.8% on REBEL, 75.7% on WebNLG2), with ablations showing coreference resolution and decomposition boost recall on rare relations. Overall, the work argues for an open-source, modular approach to knowledge extraction that scales with domain-specific data and prompting strategies.

Abstract

We introduce CoDe-KG, an open-source, end-to-end pipeline for extracting sentence-level knowledge graphs by combining robust coreference resolution with syntactic sentence decomposition. Using our model, we contribute a dataset of over 150,000 knowledge triples, which is open source. We also contribute a training corpus of 7248 rows for sentence complexity, 190 rows of gold human annotations for co-reference resolution using open source lung-cancer abstracts from PubMed, 900 rows of gold human annotations for sentence conversion policies, and 398 triples of gold human annotations. We systematically select optimal prompt-model pairs across five complexity categories, showing that hybrid chain-of-thought and few-shot prompting yields up to 99.8% exact-match accuracy on sentence simplification. On relation extraction (RE), our pipeline achieves 65.8% macro-F1 on REBEL, an 8-point gain over the prior state of the art, and 75.7% micro-F1 on WebNLG2, while matching or exceeding performance on Wiki-NRE and CaRB. Ablation studies demonstrate that integrating coreference and decomposition increases recall on rare relations by over 20%. Code and dataset are available at https://github.com/KaushikMahmud/CoDe-KG_EMNLP_2025

Automated Knowledge Graph Construction using Large Language Models and Sentence Complexity Modelling

TL;DR

Abstract

Automated Knowledge Graph Construction using Large Language Models and Sentence Complexity Modelling

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)