IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus
Honghao Gui, Lin Yuan, Hongbin Ye, Ningyu Zhang, Mengshu Sun, Lei Liang, Huajun Chen
TL;DR
IEPile addresses the limited, non-standardized data hindering large-language-model-based IE by creating a comprehensive bilingual instruction corpus (~0.32B tokens) through cleaning 33 IE datasets and introducing schema-based instruction generation. The method includes positive/negative schema signaling, hard negative schema construction, and batched schema querying to improve generalization and reduce train–eval gaps. Experiments show zero-shot IE gains for several models trained on IEPile, with OneKE achieving strong supervised performance via full fine-tuning and LLMs like LLaMA2-IEPile delivering notable English/Chinese IE capabilities. The work demonstrates a scalable framework for building domain-specific IE datasets and highlights insights on schema ambiguity and query-count consistency, offering practical tools for NLP researchers and practitioners. The resource is open-sourced to support broader IE research and deployment efforts, enabling improved cross-domain information extraction with schema-guided prompts.
Abstract
Large Language Models (LLMs) demonstrate remarkable potential across various domains; however, they exhibit a significant performance gap in Information Extraction (IE). Note that high-quality instruction data is the vital key for enhancing the specific capabilities of LLMs, while current IE datasets tend to be small in scale, fragmented, and lack standardized schema. To this end, we introduce IEPile, a comprehensive bilingual (English and Chinese) IE instruction corpus, which contains approximately 0.32B tokens. We construct IEPile by collecting and cleaning 33 existing IE datasets, and introduce schema-based instruction generation to unearth a large-scale corpus. Experimentally, IEPile enhance the performance of LLMs for IE, with notable improvements in zero-shot generalization. We open-source the resource and pre-trained models, hoping to provide valuable support to the NLP community.
