GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
Zihong Chen, Wanli Jiang, Jinzhe Li, Zhonghang Yuan, Huanjun Kong, Wanli Ouyang, Nanqing Dong
TL;DR
GraphGen presents a knowledge-graph-guided data synthesis framework to alleviate data bottlenecks in supervised fine-tuning of LLMs for knowledge-intensive tasks. It constructs a fine-grained KG from source text, identifies knowledge blind spots with $ECE$, uses $k$-hop subgraphs for coherent context, and applies style-controlled generation to produce diverse QA data across atomic, aggregated, and multi-hop QA. The approach consistently outperforms five baselines on three domains under closed-book QA, achieving stronger lexical diversity, coherence, and long-tail coverage, while demonstrating robustness across model architectures and scales. This contributes a practical, scalable pipeline for improved data quality in SFT and highlights the value of explicit knowledge structure in synthetic data generation.
Abstract
Fine-tuning for large language models (LLMs) typically requires substantial amounts of high-quality supervised data, which is both costly and labor-intensive to acquire. While synthetic data generation has emerged as a promising solution, existing approaches frequently suffer from factual inaccuracies, insufficient long-tail coverage, simplistic knowledge structures, and homogenized outputs. To address these challenges, we introduce GraphGen, a knowledge graph-guided framework designed for three key question-answering (QA) scenarios: atomic QA, aggregated QA, and multi-hop QA. It begins by constructing a fine-grained knowledge graph from the source text. It then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data. Experimental results on knowledge-intensive tasks under closed-book settings demonstrate that GraphGen outperforms conventional synthetic data methods, offering a more reliable and comprehensive solution to the data scarcity challenge in supervised fine-tuning. The code and data are publicly available at https://github.com/open-sciencelab/GraphGen.
