Table of Contents
Fetching ...

Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph Construction

Bowen Zhang, Harold Soh

TL;DR

This work introduces Extract-Define-Canonicalize (EDC), a three-phase framework for knowledge graph construction that uses open information extraction, schema definition, and post-hoc canonicalization to build high-quality KGs without being constrained by large pre-defined schemas. A refinement extension, EDC+, incorporates a trained Schema Retriever to retrieve schema elements relevant to input text, improving extraction performance in a retrieval-augmented generation style. Across WebNLG, REBEL, and Wiki-NRE, EDC demonstrates superior performance to state-of-the-art baselines, with EDC+R providing further gains and robust performance under both Target Alignment and Self Canonicalization settings. The work highlights the framework’s scalability to large schemas and its applicability to scenarios where no fixed schema is available, underscoring practical impact for real-world KGC tasks and downstream applications like reasoning and question answering.

Abstract

In this work, we are interested in automated methods for knowledge graph creation (KGC) from input text. Progress on large language models (LLMs) has prompted a series of recent works applying them to KGC, e.g., via zero/few-shot prompting. Despite successes on small domain-specific datasets, these models face difficulties scaling up to text common in many real-world applications. A principal issue is that, in prior methods, the KG schema has to be included in the LLM prompt to generate valid triplets; larger and more complex schemas easily exceed the LLMs' context window length. Furthermore, there are scenarios where a fixed pre-defined schema is not available and we would like the method to construct a high-quality KG with a succinct self-generated schema. To address these problems, we propose a three-phase framework named Extract-Define-Canonicalize (EDC): open information extraction followed by schema definition and post-hoc canonicalization. EDC is flexible in that it can be applied to settings where a pre-defined target schema is available and when it is not; in the latter case, it constructs a schema automatically and applies self-canonicalization. To further improve performance, we introduce a trained component that retrieves schema elements relevant to the input text; this improves the LLMs' extraction performance in a retrieval-augmented generation-like manner. We demonstrate on three KGC benchmarks that EDC is able to extract high-quality triplets without any parameter tuning and with significantly larger schemas compared to prior works. Code for EDC is available at https://github.com/clear-nus/edc.

Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph Construction

TL;DR

This work introduces Extract-Define-Canonicalize (EDC), a three-phase framework for knowledge graph construction that uses open information extraction, schema definition, and post-hoc canonicalization to build high-quality KGs without being constrained by large pre-defined schemas. A refinement extension, EDC+, incorporates a trained Schema Retriever to retrieve schema elements relevant to input text, improving extraction performance in a retrieval-augmented generation style. Across WebNLG, REBEL, and Wiki-NRE, EDC demonstrates superior performance to state-of-the-art baselines, with EDC+R providing further gains and robust performance under both Target Alignment and Self Canonicalization settings. The work highlights the framework’s scalability to large schemas and its applicability to scenarios where no fixed schema is available, underscoring practical impact for real-world KGC tasks and downstream applications like reasoning and question answering.

Abstract

In this work, we are interested in automated methods for knowledge graph creation (KGC) from input text. Progress on large language models (LLMs) has prompted a series of recent works applying them to KGC, e.g., via zero/few-shot prompting. Despite successes on small domain-specific datasets, these models face difficulties scaling up to text common in many real-world applications. A principal issue is that, in prior methods, the KG schema has to be included in the LLM prompt to generate valid triplets; larger and more complex schemas easily exceed the LLMs' context window length. Furthermore, there are scenarios where a fixed pre-defined schema is not available and we would like the method to construct a high-quality KG with a succinct self-generated schema. To address these problems, we propose a three-phase framework named Extract-Define-Canonicalize (EDC): open information extraction followed by schema definition and post-hoc canonicalization. EDC is flexible in that it can be applied to settings where a pre-defined target schema is available and when it is not; in the latter case, it constructs a schema automatically and applies self-canonicalization. To further improve performance, we introduce a trained component that retrieves schema elements relevant to the input text; this improves the LLMs' extraction performance in a retrieval-augmented generation-like manner. We demonstrate on three KGC benchmarks that EDC is able to extract high-quality triplets without any parameter tuning and with significantly larger schemas compared to prior works. Code for EDC is available at https://github.com/clear-nus/edc.
Paper Structure (41 sections, 1 equation, 3 figures, 10 tables)

This paper contains 41 sections, 1 equation, 3 figures, 10 tables.

Figures (3)

  • Figure 1: A high-level illustration of Extract-Define-Canonicalize (EDC) for Knowledge Graph Construction.
  • Figure 2: Performance of EDC and EDC+R on WebNLG, REBEL, and Wiki-NRE datasets against baselines in the Target Alignment setting (F1 scores with 'Partial' criteria). EDC+R only performs one iteration of refinement due to diminishing marginal improvement.
  • Figure 3: An example screenshot of the questionnaire including the instructions given to the annotators.