Table of Contents
Fetching ...

InstructIE: A Bilingual Instruction-based Information Extraction Dataset

Honghao Gui, Shuofei Qiao, Jintian Zhang, Hongbin Ye, Mengshu Sun, Lei Liang, Jeff Z. Pan, Huajun Chen, Ningyu Zhang

TL;DR

This work tackles the limited availability of comprehensive instruction-based information extraction (IE) data by introducing InstructIE, a bilingual Chinese–English IE dataset spanning 12 domains and 123 relation types. It presents KG2Instruction, a framework that automatically generates IE instruction data by aligning knowledge graphs with text, augmenting missing triples with an IE model, and filtering unreliable triples with natural language inference, yielding 364,074 instances plus a 2,000-sample test set. Experimental results show that instruction-tuned LLMs on InstructIE achieve substantial gains in zero-shot, in-context, and fine-tuned settings, with notable improvements in generalization to unseen schemas. The work demonstrates the feasibility and value of automatic, domain-spanning IE data generation for enhancing KG construction and downstream tasks, while acknowledging language, domain, and noise limitations and outlining directions for future expansion and refinement.

Abstract

Large language models can perform well on general natural language tasks, but their effectiveness is still suboptimal for information extraction (IE). Recent works indicate that the main reason lies in the lack of extensive data on IE instructions. Note that the existing datasets on IE instructions not only have limited coverage but also involve high construction costs. To address this issue, we introduce InstructIE, a bilingual instruction-based IE dataset, which covers 12 diverse domains. We propose KG2Instruction, a framework specifically for the automatic generation of such datasets. Additionally, we manually annotate the test set. Experimental results demonstrate that large language models trained with InstructIE can not only obtain better IE capabilities but also enhance zero-shot performance compared with baselines.

InstructIE: A Bilingual Instruction-based Information Extraction Dataset

TL;DR

This work tackles the limited availability of comprehensive instruction-based information extraction (IE) data by introducing InstructIE, a bilingual Chinese–English IE dataset spanning 12 domains and 123 relation types. It presents KG2Instruction, a framework that automatically generates IE instruction data by aligning knowledge graphs with text, augmenting missing triples with an IE model, and filtering unreliable triples with natural language inference, yielding 364,074 instances plus a 2,000-sample test set. Experimental results show that instruction-tuned LLMs on InstructIE achieve substantial gains in zero-shot, in-context, and fine-tuned settings, with notable improvements in generalization to unseen schemas. The work demonstrates the feasibility and value of automatic, domain-spanning IE data generation for enhancing KG construction and downstream tasks, while acknowledging language, domain, and noise limitations and outlining directions for future expansion and refinement.

Abstract

Large language models can perform well on general natural language tasks, but their effectiveness is still suboptimal for information extraction (IE). Recent works indicate that the main reason lies in the lack of extensive data on IE instructions. Note that the existing datasets on IE instructions not only have limited coverage but also involve high construction costs. To address this issue, we introduce InstructIE, a bilingual instruction-based IE dataset, which covers 12 diverse domains. We propose KG2Instruction, a framework specifically for the automatic generation of such datasets. Additionally, we manually annotate the test set. Experimental results demonstrate that large language models trained with InstructIE can not only obtain better IE capabilities but also enhance zero-shot performance compared with baselines.
Paper Structure (35 sections, 7 figures, 6 tables)

This paper contains 35 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Comparison of traditional information extraction (IE) approaches with Instruction-based IE in handling emergent classes (unseen during training). Dashed lines and represent the addition of a new class (e.g., post). Traditional approaches often struggle to accommodate the evolving demands of user extraction requirements. In contrast, Instruction-based IE demonstrates the capability to comprehend instructions, discern changes in requirements, and effectively extract newly added classes.
  • Figure 2: Examples of instructions and their outputs for knowledge graph construction, with the Schema Repository containing labels under various domains.
  • Figure 3: Overview of InstructIE dataset construction. (a) Identify Entity Mentions. (b) Disambiguation. (c) Schema Constraint Matching. (d) Missing Triplets Supplement with LLM. (e) Hallucinatory Triplets Filtering with NLI.
  • Figure 4: Classification of 14 entity types, aiming at covering a diverse array of entities with distinct boundaries.
  • Figure 5: (a) The results of Baichuan2-13B-Chat (LoRA tuning) on the InstructIE-ZH subset, (b) The results of LLaMA2-13B-Chat on the InstructIE-EN subset. The label w/o LLMs denotes the removal of the step "Missing Triplets Supplement with LLM", w/o NLI indicates the removal of the step "Hallucinatory Triplets Filtering with NLI", and w/o NLI and LLMs signifies the removal of both steps.
  • ...and 2 more figures