Table of Contents
Fetching ...

An Empirical Study on Information Extraction using Large Language Models

Ridong Han, Chaohao Yang, Tao Peng, Prayag Tiwari, Xiang Wan, Lu Liu, Benyou Wang

TL;DR

This study provides an empirical evaluation of GPT-4 on information extraction across 14 subtasks and 16 datasets, revealing a persistent gap to state-of-the-art supervised IE methods. It shows GPT-4 benefits from few-shot prompts but chain-of-thought prompts offer inconsistent gains, and proposes a soft-matching evaluation to better reflect GPT-4’s human-like span generation. The authors introduce three prompt-based improvement methods—Task-related Knowledge Informing, Methodology Specifying, and Sufficient Extraction Reminder—and demonstrate that targeted prompting can meaningfully close parts of the performance gap, especially on harder tasks. They also analyze robustness, error types, and long-tail effects, highlighting limitations in subject–object ordering and annotation quality, and discuss practical implications for using LLMs in IE applications and data annotation.

Abstract

Human-like large language models (LLMs), especially the most powerful and popular ones in OpenAI's GPT family, have proven to be very helpful for many natural language processing (NLP) related tasks. Therefore, various attempts have been made to apply LLMs to information extraction (IE), which is a fundamental NLP task that involves extracting information from unstructured plain text. To demonstrate the latest representative progress in LLMs' information extraction ability, we assess the information extraction ability of GPT-4 (the latest version of GPT at the time of writing this paper) from four perspectives: Performance, Evaluation Criteria, Robustness, and Error Types. Our results suggest a visible performance gap between GPT-4 and state-of-the-art (SOTA) IE methods. To alleviate this problem, considering the LLMs' human-like characteristics, we propose and analyze the effects of a series of simple prompt-based methods, which can be generalized to other LLMs and NLP tasks. Rich experiments show our methods' effectiveness and some of their remaining issues in improving GPT-4's information extraction ability.

An Empirical Study on Information Extraction using Large Language Models

TL;DR

This study provides an empirical evaluation of GPT-4 on information extraction across 14 subtasks and 16 datasets, revealing a persistent gap to state-of-the-art supervised IE methods. It shows GPT-4 benefits from few-shot prompts but chain-of-thought prompts offer inconsistent gains, and proposes a soft-matching evaluation to better reflect GPT-4’s human-like span generation. The authors introduce three prompt-based improvement methods—Task-related Knowledge Informing, Methodology Specifying, and Sufficient Extraction Reminder—and demonstrate that targeted prompting can meaningfully close parts of the performance gap, especially on harder tasks. They also analyze robustness, error types, and long-tail effects, highlighting limitations in subject–object ordering and annotation quality, and discuss practical implications for using LLMs in IE applications and data annotation.

Abstract

Human-like large language models (LLMs), especially the most powerful and popular ones in OpenAI's GPT family, have proven to be very helpful for many natural language processing (NLP) related tasks. Therefore, various attempts have been made to apply LLMs to information extraction (IE), which is a fundamental NLP task that involves extracting information from unstructured plain text. To demonstrate the latest representative progress in LLMs' information extraction ability, we assess the information extraction ability of GPT-4 (the latest version of GPT at the time of writing this paper) from four perspectives: Performance, Evaluation Criteria, Robustness, and Error Types. Our results suggest a visible performance gap between GPT-4 and state-of-the-art (SOTA) IE methods. To alleviate this problem, considering the LLMs' human-like characteristics, we propose and analyze the effects of a series of simple prompt-based methods, which can be generalized to other LLMs and NLP tasks. Rich experiments show our methods' effectiveness and some of their remaining issues in improving GPT-4's information extraction ability.
Paper Structure (37 sections, 2 figures, 10 tables, 1 algorithm)

This paper contains 37 sections, 2 figures, 10 tables, 1 algorithm.

Figures (2)

  • Figure 1: An example of prompts for NER-Flat sub-task on CoNLL03 dataset. See the \ref{['sec:example_prompt']} for more prompts.
  • Figure 2: Percentage of error types for ABSA-AESC, NER-Flat, RE-Triplet, and EE-Trigger sub-tasks on $D_{20a}$-14lap, CoNLL03, CoNLL04, and ACE05-Evt datasets respectively.