Revisiting Relation Extraction in the era of Large Language Models

Somin Wadhwa; Silvio Amir; Byron C. Wallace

Revisiting Relation Extraction in the era of Large Language Models

Somin Wadhwa, Silvio Amir, Byron C. Wallace

TL;DR

This work reframes relation extraction as conditional text generation using large language models. It demonstrates that few-shot prompting with GPT-3 delivers near-state-of-the-art results on standard RE datasets, and that the open-source Flan-T5 can reach state-of-the-art performance when trained with chain-of-thought explanations elicited from GPT-3. It further presents a rigorous evaluation framework that uses human judgments to remedy the pitfalls of exact-match metrics for generative RE. The findings indicate that LLM-based RE can serve as a strong baseline, with CoT-guided supervision offering significant gains, and highlight practical considerations around evaluation, cost, and generalization across datasets.

Abstract

Relation extraction (RE) is the core NLP task of inferring semantic relationships between entities from text. Standard supervised RE techniques entail training modules to tag tokens comprising entity spans and then predict the relationship between them. Recent work has instead treated the problem as a \emph{sequence-to-sequence} task, linearizing relations between entities as target strings to be generated conditioned on the input. Here we push the limits of this approach, using larger language models (GPT-3 and Flan-T5 large) than considered in prior work and evaluating their performance on standard RE tasks under varying levels of supervision. We address issues inherent to evaluating generative approaches to RE by doing human evaluations, in lieu of relying on exact matching. Under this refined evaluation, we find that: (1) Few-shot prompting with GPT-3 achieves near SOTA performance, i.e., roughly equivalent to existing fully supervised models; (2) Flan-T5 is not as capable in the few-shot setting, but supervising and fine-tuning it with Chain-of-Thought (CoT) style explanations (generated via GPT-3) yields SOTA results. We release this model as a new baseline for RE tasks.

Revisiting Relation Extraction in the era of Large Language Models

TL;DR

Abstract

Paper Structure (34 sections, 1 equation, 4 figures, 8 tables)

This paper contains 34 sections, 1 equation, 4 figures, 8 tables.

Introduction
RE via Text Generation
Challenges inherent to evaluating generative large language models for RE
In-Context Few-Shot Learning with GPT-3 for RE
Prompts
ADE
CoNLL
NYT
Manually re-evaluating "errors"
Results
CoNLL
NYT
SOTA RE Performance with Flan-T5
Few-Shot RE with Flan-T5
Fine-tuning Flan-T5 for RE
...and 19 more sections

Figures (4)

Figure 1: RE performance of LLMs on the CoNLL dataset. 1Few-shot GPT-3 slightly outperforms the existing fully supervised SOTA method (huguet-cabot-navigli-2021-rebel-relation; dotted horizontal line). 2 Eliciting CoT reasoning from GPT-3 further improves few-shot performance. 3 Fine-tuning Flan-T5 (large) is competitive with, but no better than, existing supervised methods, but 4 supervising Flan-T5 with CoT reasoning elicited from GPT-3 substantially outperforms all other models.
Figure 2: Examples of misclassified FPs and FNs from GPT-3 (generated under few-shot in-context prompting scheme) under traditional evaluation of generative output. In each instance, the entity-type of subject and object was correctly identified.
Figure 3: We propose fine-tuning Flan-T5 (large) for relation extraction (RE) using standard supervision and Chain-of-Thought (CoT) reasoning elicited from GPT-3 for RE. This yields SOTA performance across all datasets considered, often by substantial margin ($\sim$5 points absolute gain in F1).
Figure 4: AUC plots for FPs and FNs.

Revisiting Relation Extraction in the era of Large Language Models

TL;DR

Abstract

Revisiting Relation Extraction in the era of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)