Table of Contents
Fetching ...

RAKG:Document-level Retrieval Augmented Knowledge Graph Construction

Hairong Zhang, Jiaheng Si, Guohang Yan, Boyuan Qi, Pinlong Cai, Song Mao, Ding Wang, Botian Shi

TL;DR

This work tackles document-level knowledge graph construction by introducing RAKG, a framework that uses pre-entities and retrieval-augmented generation to build coherent, globally connected KG representations from documents. The method comprises document chunking, KG and text vectorization, pre-entity construction with disambiguation, corpus retrospective retrieval, graph structure retrieval, and LLM-assisted relationship network generation with a dual LLM-based evaluation to suppress hallucinations. Key contributions include the integration of RAG-style evaluation into KGC, a progressive extraction approach with pre-entities to reduce disambiguation and forgetting, and a formal evaluation on the MINE dataset showing significant accuracy gains (e.g., $95.81\%$-range) over GraphRAG and KGGen, along with higher entity coverage and relation similarity. The work demonstrates that document-level KG construction benefits from retrieving multi-source information and validating outputs with an LLM judge, improving both completeness and semantic fidelity, and provides an open-source implementation for community use.

Abstract

With the rise of knowledge graph based retrieval-augmented generation (RAG) techniques such as GraphRAG and Pike-RAG, the role of knowledge graphs in enhancing the reasoning capabilities of large language models (LLMs) has become increasingly prominent. However, traditional Knowledge Graph Construction (KGC) methods face challenges like complex entity disambiguation, rigid schema definition, and insufficient cross-document knowledge integration. This paper focuses on the task of automatic document-level knowledge graph construction. It proposes the Document-level Retrieval Augmented Knowledge Graph Construction (RAKG) framework. RAKG extracts pre-entities from text chunks and utilizes these pre-entities as queries for RAG, effectively addressing the issue of long-context forgetting in LLMs and reducing the complexity of Coreference Resolution. In contrast to conventional KGC methods, RAKG more effectively captures global information and the interconnections among disparate nodes, thereby enhancing the overall performance of the model. Additionally, we transfer the RAG evaluation framework to the KGC field and filter and evaluate the generated knowledge graphs, thereby avoiding incorrectly generated entities and relationships caused by hallucinations in LLMs. We further developed the MINE dataset by constructing standard knowledge graphs for each article and experimentally validated the performance of RAKG. The results show that RAKG achieves an accuracy of 95.91 % on the MINE dataset, a 6.2 % point improvement over the current best baseline, GraphRAG (89.71 %). The code is available at https://github.com/LMMApplication/RAKG.

RAKG:Document-level Retrieval Augmented Knowledge Graph Construction

TL;DR

This work tackles document-level knowledge graph construction by introducing RAKG, a framework that uses pre-entities and retrieval-augmented generation to build coherent, globally connected KG representations from documents. The method comprises document chunking, KG and text vectorization, pre-entity construction with disambiguation, corpus retrospective retrieval, graph structure retrieval, and LLM-assisted relationship network generation with a dual LLM-based evaluation to suppress hallucinations. Key contributions include the integration of RAG-style evaluation into KGC, a progressive extraction approach with pre-entities to reduce disambiguation and forgetting, and a formal evaluation on the MINE dataset showing significant accuracy gains (e.g., -range) over GraphRAG and KGGen, along with higher entity coverage and relation similarity. The work demonstrates that document-level KG construction benefits from retrieving multi-source information and validating outputs with an LLM judge, improving both completeness and semantic fidelity, and provides an open-source implementation for community use.

Abstract

With the rise of knowledge graph based retrieval-augmented generation (RAG) techniques such as GraphRAG and Pike-RAG, the role of knowledge graphs in enhancing the reasoning capabilities of large language models (LLMs) has become increasingly prominent. However, traditional Knowledge Graph Construction (KGC) methods face challenges like complex entity disambiguation, rigid schema definition, and insufficient cross-document knowledge integration. This paper focuses on the task of automatic document-level knowledge graph construction. It proposes the Document-level Retrieval Augmented Knowledge Graph Construction (RAKG) framework. RAKG extracts pre-entities from text chunks and utilizes these pre-entities as queries for RAG, effectively addressing the issue of long-context forgetting in LLMs and reducing the complexity of Coreference Resolution. In contrast to conventional KGC methods, RAKG more effectively captures global information and the interconnections among disparate nodes, thereby enhancing the overall performance of the model. Additionally, we transfer the RAG evaluation framework to the KGC field and filter and evaluate the generated knowledge graphs, thereby avoiding incorrectly generated entities and relationships caused by hallucinations in LLMs. We further developed the MINE dataset by constructing standard knowledge graphs for each article and experimentally validated the performance of RAKG. The results show that RAKG achieves an accuracy of 95.91 % on the MINE dataset, a 6.2 % point improvement over the current best baseline, GraphRAG (89.71 %). The code is available at https://github.com/LMMApplication/RAKG.

Paper Structure

This paper contains 29 sections, 14 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The RAKG framework processes documents through sentence segmentation and vectorization, extracts preliminary entities, and performs entity disambiguation and vectorization. The processed entities undergo Corpus Retrospective Retrieval to obtain relevant texts and Graph Structure Retrieval to get related KG. Subsequently, LLM is employed to integrate the retrieved information for constructing relation networks, which are merged for each entity. Finally, the newly built knowledge graph is combined with the original one.
  • Figure 2: Distribution of SC scores across 105 articles for GraphRAG, KGGen, and RAKG on the MINE dataset. The results demonstrate that RAKG achieves an accuracy of 95.81%, outperforming KGGen (86.48%) and GraphRAG (89.71%).
  • Figure 3: This visualisation of the experimental results shows the entity density and relation richness of knowledge graphs generated by RAKG, GraphRAG, and KGGen. The results indicate that RAKG produces more dense entities and richer relations than GraphRAG and KGGen.
  • Figure 4: The process of LLM as judge: Extracted entities are checked against the source text to eliminate hallucinations. The retriever uses entities to fetch relevant texts and KG, building a relation network. This network is then verified for consistency with the retrieved information.
  • Figure 5: Results of LLM as judge: The pass rate for entities is around 91.33%, and the pass rate for relation networks is approximately 94.51%.
  • ...and 1 more figures