ADELIE: Aligning Large Language Models on Information Extraction

Yunjia Qi; Hao Peng; Xiaozhi Wang; Bin Xu; Lei Hou; Juanzi Li

ADELIE: Aligning Large Language Models on Information Extraction

Yunjia Qi, Hao Peng, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li

TL;DR

ADELIE (Aligning large language moDELs on Information Extraction), an aligned LLM that effectively solves various IE tasks, including closed IE, open IE, and on-demand IE, achieves state-of-the-art (SoTA) performance among open-source models.

Abstract

Large language models (LLMs) usually fall short on information extraction (IE) tasks and struggle to follow the complex instructions of IE tasks. This primarily arises from LLMs not being aligned with humans, as mainstream alignment datasets typically do not include IE data. In this paper, we introduce ADELIE (Aligning large language moDELs on Information Extraction), an aligned LLM that effectively solves various IE tasks, including closed IE, open IE, and on-demand IE. We first collect and construct a high-quality alignment corpus IEInstruct for IE. Then we train ADELIE_SFT using instruction tuning on IEInstruct. We further train ADELIE_SFT with direct preference optimization (DPO) objective, resulting in ADELIE_DPO. Extensive experiments on various held-out IE datasets demonstrate that our models (ADELIE_SFT and ADELIE_DPO) achieve state-of-the-art (SoTA) performance among open-source models. We further explore the general capabilities of ADELIE, and experimental results reveal that their general capabilities do not exhibit a noticeable decline. We will release the code, data, and models to facilitate further research.

ADELIE: Aligning Large Language Models on Information Extraction

TL;DR

Abstract

Paper Structure (44 sections, 6 figures, 11 tables)

This paper contains 44 sections, 6 figures, 11 tables.

Introduction
Related Work
Information Extraction Tasks
LLMs for Information Extraction
Alignment Data Construction
IE Data Collection
Input Construction
Task Description
Schema Description
Output Format Description
Few-shot Demonstrations
Answer Construction
Model Training
Experiments
Experimental Setup
...and 29 more sections

Figures (6)

Figure 1: F1 scores (%) on closed, open, and on-demand IE tasks in the few-shot setting. SoTA* denotes the best performance of open-source models.
Figure 2: IE tasks, datasets, and respective proportions in IEInstruct.
Figure 3: An example of the input and output in IEInstruct. $50\%$ of the data in IEInstruct includes in-context demonstrations. The instruction consists of the descriptions of task, schema, and output format. The output consists of an explanation (for $10\%$ of the instances in IEInstruct) and the answer adhering to the format in instruction.
Figure 4: Scores (%) on IE tasks (average of closed IE, open IE, and on-demand IE) and general tasks (average of commonsense reasoning, MMLU, and BBH) of our model trained with varying proportions of IE data. We finally adopt a proportion of $20\%$ to train ADELIESFT.
Figure 5: Performance improvements (%) of the model trained on varying scales of data, compared to ADELIESFT before DPO training.
...and 1 more figures

ADELIE: Aligning Large Language Models on Information Extraction

TL;DR

Abstract

ADELIE: Aligning Large Language Models on Information Extraction

Authors

TL;DR

Abstract

Table of Contents

Figures (6)