IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus

Honghao Gui; Lin Yuan; Hongbin Ye; Ningyu Zhang; Mengshu Sun; Lei Liang; Huajun Chen

IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus

Honghao Gui, Lin Yuan, Hongbin Ye, Ningyu Zhang, Mengshu Sun, Lei Liang, Huajun Chen

TL;DR

IEPile addresses the limited, non-standardized data hindering large-language-model-based IE by creating a comprehensive bilingual instruction corpus (~0.32B tokens) through cleaning 33 IE datasets and introducing schema-based instruction generation. The method includes positive/negative schema signaling, hard negative schema construction, and batched schema querying to improve generalization and reduce train–eval gaps. Experiments show zero-shot IE gains for several models trained on IEPile, with OneKE achieving strong supervised performance via full fine-tuning and LLMs like LLaMA2-IEPile delivering notable English/Chinese IE capabilities. The work demonstrates a scalable framework for building domain-specific IE datasets and highlights insights on schema ambiguity and query-count consistency, offering practical tools for NLP researchers and practitioners. The resource is open-sourced to support broader IE research and deployment efforts, enabling improved cross-domain information extraction with schema-guided prompts.

Abstract

Large Language Models (LLMs) demonstrate remarkable potential across various domains; however, they exhibit a significant performance gap in Information Extraction (IE). Note that high-quality instruction data is the vital key for enhancing the specific capabilities of LLMs, while current IE datasets tend to be small in scale, fragmented, and lack standardized schema. To this end, we introduce IEPile, a comprehensive bilingual (English and Chinese) IE instruction corpus, which contains approximately 0.32B tokens. We construct IEPile by collecting and cleaning 33 existing IE datasets, and introduce schema-based instruction generation to unearth a large-scale corpus. Experimentally, IEPile enhance the performance of LLMs for IE, with notable improvements in zero-shot generalization. We open-source the resource and pre-trained models, hoping to provide valuable support to the NLP community.

IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus

TL;DR

Abstract

Paper Structure (38 sections, 4 figures, 12 tables, 1 algorithm)

This paper contains 38 sections, 4 figures, 12 tables, 1 algorithm.

Introduction
IEPile
Data Collection and Cleaning
Schema-Based Instruction Generation
Positive and Negative Schema Mechanism in Instructions.
Hard Negative Schema Construction.
Batched Instruction Generation.
Data Statistics
Experiments
Experimental Settings
Main Results
Analysis
Inconsistency in the Number of Schema Queries Hurt Generalization.
Inadequate Differentiation Among Schemas Lead to Semantic Similar Confusion.
Conclusion and Future Work
...and 23 more sections

Figures (4)

Figure 1: An overview of the construction of IEPile, including Data Collection and Cleaning, as well as Schema-Based Instruction Generation (Hard Negative Schema Construction and Batched Instruction Generation).
Figure 2: Distribution of different tasks, domains, and source datasets within the IEPile.
Figure 3: (a) When there is an inconsistency in the number of schema inquiries during the training and evaluation, the performance of the model significantly decreases. (b) The impact of removing the hard negative schema dictionary on the performance of the model.
Figure 4: An exemplar of data records for OntoNotes: the domain, the number and details of schemas, the total volume of data, the $split\_num$, the number of instructions produced using our method, along with the distribution of split count within the interval [($split\_num$ / 2), ($split\_num$ + $split\_num$ / 2)].

IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus

TL;DR

Abstract

IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus

Authors

TL;DR

Abstract

Table of Contents

Figures (4)