Distill-SynthKG: Distilling Knowledge Graph Synthesis Workflow for Improved Coverage and Efficiency

Prafulla Kumar Choubey; Xin Su; Man Luo; Xiangyu Peng; Caiming Xiong; Tiep Le; Shachar Rosenman; Vasudev Lal; Phil Mui; Ricky Ho; Phillip Howard; Chien-Sheng Wu

Distill-SynthKG: Distilling Knowledge Graph Synthesis Workflow for Improved Coverage and Efficiency

Prafulla Kumar Choubey, Xin Su, Man Luo, Xiangyu Peng, Caiming Xiong, Tiep Le, Shachar Rosenman, Vasudev Lal, Phil Mui, Ricky Ho, Phillip Howard, Chien-Sheng Wu

TL;DR

This work tackles inefficiencies and information loss in ontology-free KG construction for Retrieval-Augmented Generation by introducing SynthKG, a multi-step KG synthesis workflow, and Distill-SynthKG, a smaller model distilled from SynthKG to generate KGs in a single inference. It also develops an evaluation framework based on proxy triplets derived from multihop QA and a graph-based retrieval method (Proposition-Entity Graph Retriever) to improve KG-driven retrieval and QA. Empirical results across MuSiQue, 2WikiMultiHopQA, and HotpotQA show that Distill-SynthKG delivers higher KG quality, retrieval accuracy, and multi-hop QA performance than larger baselines, while the graph-based retriever consistently outperforms KG-based baselines. The authors publicly release the SynthKG dataset and the Distill-SynthKG model to spur further research in scalable, ontology-free KG construction for RAG systems.

Abstract

Knowledge graphs (KGs) generated by large language models (LLMs) are becoming increasingly valuable for Retrieval-Augmented Generation (RAG) applications that require knowledge-intensive reasoning. However, existing KG extraction methods predominantly rely on prompt-based approaches, which are inefficient for processing large-scale corpora. These approaches often suffer from information loss, particularly with long documents, due to the lack of specialized design for KG construction. Additionally, there is a gap in evaluation datasets and methodologies for ontology-free KG construction. To overcome these limitations, we propose SynthKG, a multi-step, document-level ontology-free KG synthesis workflow based on LLMs. By fine-tuning a smaller LLM on the synthesized document-KG pairs, we streamline the multi-step process into a single-step KG generation approach called Distill-SynthKG, substantially reducing the number of LLM inference calls. Furthermore, we re-purpose existing question-answering datasets to establish KG evaluation datasets and introduce new evaluation metrics. Using KGs produced by Distill-SynthKG, we also design a novel graph-based retrieval framework for RAG. Experimental results demonstrate that Distill-SynthKG not only surpasses all baseline models in KG quality -- including models up to eight times larger -- but also consistently excels in retrieval and question-answering tasks. Our proposed graph retrieval framework also outperforms all KG-retrieval methods across multiple benchmark datasets. We release the SynthKG dataset and Distill-SynthKG model publicly to support further research and development.

Distill-SynthKG: Distilling Knowledge Graph Synthesis Workflow for Improved Coverage and Efficiency

TL;DR

Abstract

Distill-SynthKG: Distilling Knowledge Graph Synthesis Workflow for Improved Coverage and Efficiency

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)