Table of Contents
Fetching ...

Ontology-Free General-Domain Knowledge Graph-to-Text Generation Dataset Synthesis using Large Language Model

Daehee Kim, Deokhyung Kang, Sangwon Ryu, Gary Geunbae Lee

TL;DR

This work tackles the scarcity of large-scale, general-domain Graph-to-Text data with precise graph-text alignment by introducing WikiOFGraph, a 5.85 million-sample ontology-free dataset generated through an LLM-based graph extraction pipeline guided by in-context examples and filtered with Data-QuestEval. The approach yields high graph-text consistency while enabling broad domain coverage, addressing limitations of prior ontology-based datasets. Fine-tuning a Transformer-based generator on WikiOFGraph consistently surpasses models trained on WebNLG, GenWiki, TekGen, and LAGRANGE across GenWiki and WikiOFGraph-derived test sets, demonstrating strong generalization to diverse domains. The work emphasizes reproducibility and scalability, with Data-QuestEval proving effective for quality control, and discusses future multilingual extensions and contamination concerns as avenues for further improvement.

Abstract

Knowledge Graph-to-Text (G2T) generation involves verbalizing structured knowledge graphs into natural language text. Recent advancements in Pretrained Language Models (PLMs) have improved G2T performance, but their effectiveness depends on datasets with precise graph-text alignment. However, the scarcity of high-quality, general-domain G2T generation datasets restricts progress in the general-domain G2T generation research. To address this issue, we introduce Wikipedia Ontology-Free Graph-text dataset (WikiOFGraph), a new large-scale G2T dataset generated using a novel method that leverages Large Language Model (LLM) and Data-QuestEval. Our new dataset, which contains 5.85M general-domain graph-text pairs, offers high graph-text consistency without relying on external ontologies. Experimental results demonstrate that PLM fine-tuned on WikiOFGraph outperforms those trained on other datasets across various evaluation metrics. Our method proves to be a scalable and effective solution for generating high-quality G2T data, significantly advancing the field of G2T generation.

Ontology-Free General-Domain Knowledge Graph-to-Text Generation Dataset Synthesis using Large Language Model

TL;DR

This work tackles the scarcity of large-scale, general-domain Graph-to-Text data with precise graph-text alignment by introducing WikiOFGraph, a 5.85 million-sample ontology-free dataset generated through an LLM-based graph extraction pipeline guided by in-context examples and filtered with Data-QuestEval. The approach yields high graph-text consistency while enabling broad domain coverage, addressing limitations of prior ontology-based datasets. Fine-tuning a Transformer-based generator on WikiOFGraph consistently surpasses models trained on WebNLG, GenWiki, TekGen, and LAGRANGE across GenWiki and WikiOFGraph-derived test sets, demonstrating strong generalization to diverse domains. The work emphasizes reproducibility and scalability, with Data-QuestEval proving effective for quality control, and discusses future multilingual extensions and contamination concerns as avenues for further improvement.

Abstract

Knowledge Graph-to-Text (G2T) generation involves verbalizing structured knowledge graphs into natural language text. Recent advancements in Pretrained Language Models (PLMs) have improved G2T performance, but their effectiveness depends on datasets with precise graph-text alignment. However, the scarcity of high-quality, general-domain G2T generation datasets restricts progress in the general-domain G2T generation research. To address this issue, we introduce Wikipedia Ontology-Free Graph-text dataset (WikiOFGraph), a new large-scale G2T dataset generated using a novel method that leverages Large Language Model (LLM) and Data-QuestEval. Our new dataset, which contains 5.85M general-domain graph-text pairs, offers high graph-text consistency without relying on external ontologies. Experimental results demonstrate that PLM fine-tuned on WikiOFGraph outperforms those trained on other datasets across various evaluation metrics. Our method proves to be a scalable and effective solution for generating high-quality G2T data, significantly advancing the field of G2T generation.
Paper Structure (39 sections, 9 figures, 10 tables)

This paper contains 39 sections, 9 figures, 10 tables.

Figures (9)

  • Figure 1: An example of a Graph-Text pair from existing ontology-based datasets. Although (a) and (b) are paired, the graph in (a) does not contain the information of the underlined text in (b), illustrating a common misalignment problem.
  • Figure 2: Method for constructing the WikiOFGraph. Source sentences are collected from Wikipedia. Graph representations are then extracted using an LLM through in-context learning, guided by manually selected examples from the WebNLG. Data-QuestEval Filtering curates graph-text pairs compiled into the WikiOFGraph.
  • Figure 3: Normalized distribution of the number of triplets in each dataset.
  • Figure 4: Normalized distribution of the number of words in each dataset.
  • Figure 5: Average number of words per number of triplets across different datasets.
  • ...and 4 more figures