Table of Contents
Fetching ...

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

Zihong Chen, Wanli Jiang, Jinzhe Li, Zhonghang Yuan, Huanjun Kong, Wanli Ouyang, Nanqing Dong

TL;DR

GraphGen presents a knowledge-graph-guided data synthesis framework to alleviate data bottlenecks in supervised fine-tuning of LLMs for knowledge-intensive tasks. It constructs a fine-grained KG from source text, identifies knowledge blind spots with $ECE$, uses $k$-hop subgraphs for coherent context, and applies style-controlled generation to produce diverse QA data across atomic, aggregated, and multi-hop QA. The approach consistently outperforms five baselines on three domains under closed-book QA, achieving stronger lexical diversity, coherence, and long-tail coverage, while demonstrating robustness across model architectures and scales. This contributes a practical, scalable pipeline for improved data quality in SFT and highlights the value of explicit knowledge structure in synthetic data generation.

Abstract

Fine-tuning for large language models (LLMs) typically requires substantial amounts of high-quality supervised data, which is both costly and labor-intensive to acquire. While synthetic data generation has emerged as a promising solution, existing approaches frequently suffer from factual inaccuracies, insufficient long-tail coverage, simplistic knowledge structures, and homogenized outputs. To address these challenges, we introduce GraphGen, a knowledge graph-guided framework designed for three key question-answering (QA) scenarios: atomic QA, aggregated QA, and multi-hop QA. It begins by constructing a fine-grained knowledge graph from the source text. It then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data. Experimental results on knowledge-intensive tasks under closed-book settings demonstrate that GraphGen outperforms conventional synthetic data methods, offering a more reliable and comprehensive solution to the data scarcity challenge in supervised fine-tuning. The code and data are publicly available at https://github.com/open-sciencelab/GraphGen.

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

TL;DR

GraphGen presents a knowledge-graph-guided data synthesis framework to alleviate data bottlenecks in supervised fine-tuning of LLMs for knowledge-intensive tasks. It constructs a fine-grained KG from source text, identifies knowledge blind spots with , uses -hop subgraphs for coherent context, and applies style-controlled generation to produce diverse QA data across atomic, aggregated, and multi-hop QA. The approach consistently outperforms five baselines on three domains under closed-book QA, achieving stronger lexical diversity, coherence, and long-tail coverage, while demonstrating robustness across model architectures and scales. This contributes a practical, scalable pipeline for improved data quality in SFT and highlights the value of explicit knowledge structure in synthetic data generation.

Abstract

Fine-tuning for large language models (LLMs) typically requires substantial amounts of high-quality supervised data, which is both costly and labor-intensive to acquire. While synthetic data generation has emerged as a promising solution, existing approaches frequently suffer from factual inaccuracies, insufficient long-tail coverage, simplistic knowledge structures, and homogenized outputs. To address these challenges, we introduce GraphGen, a knowledge graph-guided framework designed for three key question-answering (QA) scenarios: atomic QA, aggregated QA, and multi-hop QA. It begins by constructing a fine-grained knowledge graph from the source text. It then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data. Experimental results on knowledge-intensive tasks under closed-book settings demonstrate that GraphGen outperforms conventional synthetic data methods, offering a more reliable and comprehensive solution to the data scarcity challenge in supervised fine-tuning. The code and data are publicly available at https://github.com/open-sciencelab/GraphGen.

Paper Structure

This paper contains 47 sections, 7 equations, 22 figures, 12 tables, 1 algorithm.

Figures (22)

  • Figure 1: Pipeline of GraphGen. GraphGen optimizes LLM's performance by effectively organizing knowledge and identifying the specific data required for training the model. It comprises four core stages: Step 1 (a): Initially, entities/relationships are extracted to build a KG. Step 2 (b): Then, the Trainee Model’s understanding of knowledge points is evaluated by judging the correctness of given statements and calculating the comprehension loss accordingly. Step 3 (c): Then, subgraphs are formed for efficient training. The composition of these subgraphs is controlled using various traversal strategies. Step 4 (d): Finally, subgraphs are converted into QA pairs for the three scenarios: atomic QA, aggregated QA and multi-hop QA (see Section \ref{['sec:graphgen']} for details).
  • Figure 2: Prompt for comprehension assessment. Through binary yes/no questions, we capture precise semantic information for confidence modeling.
  • Figure 3: Performance comparison on knowledge-intensive evaluation datasets. We use data generated through various methods to optimize Qwen2.5-7B-Instruct. We use ROUGE-F as the metric. The baseline methods exhibit varying performance across the three datasets, while GraphGen consistently achieves optimal results.
  • Figure 4: Distribution of comprehension loss for the Trainee Model. The model's comprehension loss is relatively low for the vast majority of data, which indicates most of the data generated by the Synthesizer Model has already been mastered by the Trainee Model.
  • Figure 5: Performance comparison conducted with varying proportions of training data. The proportions are arranged in descending order based on loss. "Average" represents the mean score across three datasets. As the amount of training data increases, we observe a noticeable and consistent upward trend in the results.
  • ...and 17 more figures