Table of Contents
Fetching ...

Distill-SynthKG: Distilling Knowledge Graph Synthesis Workflow for Improved Coverage and Efficiency

Prafulla Kumar Choubey, Xin Su, Man Luo, Xiangyu Peng, Caiming Xiong, Tiep Le, Shachar Rosenman, Vasudev Lal, Phil Mui, Ricky Ho, Phillip Howard, Chien-Sheng Wu

TL;DR

This work tackles inefficiencies and information loss in ontology-free KG construction for Retrieval-Augmented Generation by introducing SynthKG, a multi-step KG synthesis workflow, and Distill-SynthKG, a smaller model distilled from SynthKG to generate KGs in a single inference. It also develops an evaluation framework based on proxy triplets derived from multihop QA and a graph-based retrieval method (Proposition-Entity Graph Retriever) to improve KG-driven retrieval and QA. Empirical results across MuSiQue, 2WikiMultiHopQA, and HotpotQA show that Distill-SynthKG delivers higher KG quality, retrieval accuracy, and multi-hop QA performance than larger baselines, while the graph-based retriever consistently outperforms KG-based baselines. The authors publicly release the SynthKG dataset and the Distill-SynthKG model to spur further research in scalable, ontology-free KG construction for RAG systems.

Abstract

Knowledge graphs (KGs) generated by large language models (LLMs) are becoming increasingly valuable for Retrieval-Augmented Generation (RAG) applications that require knowledge-intensive reasoning. However, existing KG extraction methods predominantly rely on prompt-based approaches, which are inefficient for processing large-scale corpora. These approaches often suffer from information loss, particularly with long documents, due to the lack of specialized design for KG construction. Additionally, there is a gap in evaluation datasets and methodologies for ontology-free KG construction. To overcome these limitations, we propose SynthKG, a multi-step, document-level ontology-free KG synthesis workflow based on LLMs. By fine-tuning a smaller LLM on the synthesized document-KG pairs, we streamline the multi-step process into a single-step KG generation approach called Distill-SynthKG, substantially reducing the number of LLM inference calls. Furthermore, we re-purpose existing question-answering datasets to establish KG evaluation datasets and introduce new evaluation metrics. Using KGs produced by Distill-SynthKG, we also design a novel graph-based retrieval framework for RAG. Experimental results demonstrate that Distill-SynthKG not only surpasses all baseline models in KG quality -- including models up to eight times larger -- but also consistently excels in retrieval and question-answering tasks. Our proposed graph retrieval framework also outperforms all KG-retrieval methods across multiple benchmark datasets. We release the SynthKG dataset and Distill-SynthKG model publicly to support further research and development.

Distill-SynthKG: Distilling Knowledge Graph Synthesis Workflow for Improved Coverage and Efficiency

TL;DR

This work tackles inefficiencies and information loss in ontology-free KG construction for Retrieval-Augmented Generation by introducing SynthKG, a multi-step KG synthesis workflow, and Distill-SynthKG, a smaller model distilled from SynthKG to generate KGs in a single inference. It also develops an evaluation framework based on proxy triplets derived from multihop QA and a graph-based retrieval method (Proposition-Entity Graph Retriever) to improve KG-driven retrieval and QA. Empirical results across MuSiQue, 2WikiMultiHopQA, and HotpotQA show that Distill-SynthKG delivers higher KG quality, retrieval accuracy, and multi-hop QA performance than larger baselines, while the graph-based retriever consistently outperforms KG-based baselines. The authors publicly release the SynthKG dataset and the Distill-SynthKG model to spur further research in scalable, ontology-free KG construction for RAG systems.

Abstract

Knowledge graphs (KGs) generated by large language models (LLMs) are becoming increasingly valuable for Retrieval-Augmented Generation (RAG) applications that require knowledge-intensive reasoning. However, existing KG extraction methods predominantly rely on prompt-based approaches, which are inefficient for processing large-scale corpora. These approaches often suffer from information loss, particularly with long documents, due to the lack of specialized design for KG construction. Additionally, there is a gap in evaluation datasets and methodologies for ontology-free KG construction. To overcome these limitations, we propose SynthKG, a multi-step, document-level ontology-free KG synthesis workflow based on LLMs. By fine-tuning a smaller LLM on the synthesized document-KG pairs, we streamline the multi-step process into a single-step KG generation approach called Distill-SynthKG, substantially reducing the number of LLM inference calls. Furthermore, we re-purpose existing question-answering datasets to establish KG evaluation datasets and introduce new evaluation metrics. Using KGs produced by Distill-SynthKG, we also design a novel graph-based retrieval framework for RAG. Experimental results demonstrate that Distill-SynthKG not only surpasses all baseline models in KG quality -- including models up to eight times larger -- but also consistently excels in retrieval and question-answering tasks. Our proposed graph retrieval framework also outperforms all KG-retrieval methods across multiple benchmark datasets. We release the SynthKG dataset and Distill-SynthKG model publicly to support further research and development.

Paper Structure

This paper contains 49 sections, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Our SynthKG data synthesis method (left side) is designed to generate high-coverage, ontology-free, document-level KGs. We distill this synthetic data into Distill-SynthKG (right side), which is applied to multiple downstream applications.
  • Figure 2: Our Proposition-Entity Graph Retriever for multi-hop reasoning retrieves semantically similar propositions, uses graph traversal to select those connected though query entities, and then re-rank selected propositions using LLMs.
  • Figure 3: Average number of triplets per 100 words for documents of different lengths, showing that SynthKG maintains the triplet density consistently across all document lengths.
  • Figure 4: Ablation study results on different combinations of input context for multi-hop QA.
  • Figure 5: Performance comparison of different KG-based retrieval methods on multi-hop QA.
  • ...and 8 more figures