InstructIE: A Bilingual Instruction-based Information Extraction Dataset
Honghao Gui, Shuofei Qiao, Jintian Zhang, Hongbin Ye, Mengshu Sun, Lei Liang, Jeff Z. Pan, Huajun Chen, Ningyu Zhang
TL;DR
This work tackles the limited availability of comprehensive instruction-based information extraction (IE) data by introducing InstructIE, a bilingual Chinese–English IE dataset spanning 12 domains and 123 relation types. It presents KG2Instruction, a framework that automatically generates IE instruction data by aligning knowledge graphs with text, augmenting missing triples with an IE model, and filtering unreliable triples with natural language inference, yielding 364,074 instances plus a 2,000-sample test set. Experimental results show that instruction-tuned LLMs on InstructIE achieve substantial gains in zero-shot, in-context, and fine-tuned settings, with notable improvements in generalization to unseen schemas. The work demonstrates the feasibility and value of automatic, domain-spanning IE data generation for enhancing KG construction and downstream tasks, while acknowledging language, domain, and noise limitations and outlining directions for future expansion and refinement.
Abstract
Large language models can perform well on general natural language tasks, but their effectiveness is still suboptimal for information extraction (IE). Recent works indicate that the main reason lies in the lack of extensive data on IE instructions. Note that the existing datasets on IE instructions not only have limited coverage but also involve high construction costs. To address this issue, we introduce InstructIE, a bilingual instruction-based IE dataset, which covers 12 diverse domains. We propose KG2Instruction, a framework specifically for the automatic generation of such datasets. Additionally, we manually annotate the test set. Experimental results demonstrate that large language models trained with InstructIE can not only obtain better IE capabilities but also enhance zero-shot performance compared with baselines.
