Table of Contents
Fetching ...

IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus

Honghao Gui, Lin Yuan, Hongbin Ye, Ningyu Zhang, Mengshu Sun, Lei Liang, Huajun Chen

TL;DR

IEPile addresses the limited, non-standardized data hindering large-language-model-based IE by creating a comprehensive bilingual instruction corpus (~0.32B tokens) through cleaning 33 IE datasets and introducing schema-based instruction generation. The method includes positive/negative schema signaling, hard negative schema construction, and batched schema querying to improve generalization and reduce train–eval gaps. Experiments show zero-shot IE gains for several models trained on IEPile, with OneKE achieving strong supervised performance via full fine-tuning and LLMs like LLaMA2-IEPile delivering notable English/Chinese IE capabilities. The work demonstrates a scalable framework for building domain-specific IE datasets and highlights insights on schema ambiguity and query-count consistency, offering practical tools for NLP researchers and practitioners. The resource is open-sourced to support broader IE research and deployment efforts, enabling improved cross-domain information extraction with schema-guided prompts.

Abstract

Large Language Models (LLMs) demonstrate remarkable potential across various domains; however, they exhibit a significant performance gap in Information Extraction (IE). Note that high-quality instruction data is the vital key for enhancing the specific capabilities of LLMs, while current IE datasets tend to be small in scale, fragmented, and lack standardized schema. To this end, we introduce IEPile, a comprehensive bilingual (English and Chinese) IE instruction corpus, which contains approximately 0.32B tokens. We construct IEPile by collecting and cleaning 33 existing IE datasets, and introduce schema-based instruction generation to unearth a large-scale corpus. Experimentally, IEPile enhance the performance of LLMs for IE, with notable improvements in zero-shot generalization. We open-source the resource and pre-trained models, hoping to provide valuable support to the NLP community.

IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus

TL;DR

IEPile addresses the limited, non-standardized data hindering large-language-model-based IE by creating a comprehensive bilingual instruction corpus (~0.32B tokens) through cleaning 33 IE datasets and introducing schema-based instruction generation. The method includes positive/negative schema signaling, hard negative schema construction, and batched schema querying to improve generalization and reduce train–eval gaps. Experiments show zero-shot IE gains for several models trained on IEPile, with OneKE achieving strong supervised performance via full fine-tuning and LLMs like LLaMA2-IEPile delivering notable English/Chinese IE capabilities. The work demonstrates a scalable framework for building domain-specific IE datasets and highlights insights on schema ambiguity and query-count consistency, offering practical tools for NLP researchers and practitioners. The resource is open-sourced to support broader IE research and deployment efforts, enabling improved cross-domain information extraction with schema-guided prompts.

Abstract

Large Language Models (LLMs) demonstrate remarkable potential across various domains; however, they exhibit a significant performance gap in Information Extraction (IE). Note that high-quality instruction data is the vital key for enhancing the specific capabilities of LLMs, while current IE datasets tend to be small in scale, fragmented, and lack standardized schema. To this end, we introduce IEPile, a comprehensive bilingual (English and Chinese) IE instruction corpus, which contains approximately 0.32B tokens. We construct IEPile by collecting and cleaning 33 existing IE datasets, and introduce schema-based instruction generation to unearth a large-scale corpus. Experimentally, IEPile enhance the performance of LLMs for IE, with notable improvements in zero-shot generalization. We open-source the resource and pre-trained models, hoping to provide valuable support to the NLP community.
Paper Structure (38 sections, 4 figures, 12 tables, 1 algorithm)

This paper contains 38 sections, 4 figures, 12 tables, 1 algorithm.

Figures (4)

  • Figure 1: An overview of the construction of IEPile, including Data Collection and Cleaning, as well as Schema-Based Instruction Generation (Hard Negative Schema Construction and Batched Instruction Generation).
  • Figure 2: Distribution of different tasks, domains, and source datasets within the IEPile.
  • Figure 3: (a) When there is an inconsistency in the number of schema inquiries during the training and evaluation, the performance of the model significantly decreases. (b) The impact of removing the hard negative schema dictionary on the performance of the model.
  • Figure 4: An exemplar of data records for OntoNotes: the domain, the number and details of schemas, the total volume of data, the $split\_num$, the number of instructions produced using our method, along with the distribution of split count within the interval [($split\_num$ / 2), ($split\_num$ + $split\_num$ / 2)].