Large Language Models for Few-Shot Named Entity Recognition
Yufei Zhao, Xiaoshi Zhong, Erik Cambria, Jagath C. Rajapakse
TL;DR
This paper introduces GPT4NER, a prompting-based framework that enables few-shot named entity recognition by converting the task into sequence generation using LLMs. It leverages three core prompt components—entity definitions, carefully selected few-shot examples with an explicit output format, and chain-of-thought reasoning—with an optional POS cue to guide predictions. Across CoNLL2003 and OntoNotes5.0, GPT4NER outperforms representative few-shot baselines and approaches a meaningful portion of fully supervised performance, with notable gains under both strict and relaxed evaluation. The work also highlights the value of relaxed-match evaluation and reporting the NEE sub-task to better understand model behavior and limitations in real-world NER tasks.
Abstract
Named entity recognition (NER) is a fundamental task in numerous downstream applications. Recently, researchers have employed pre-trained language models (PLMs) and large language models (LLMs) to address this task. However, fully leveraging the capabilities of PLMs and LLMs with minimal human effort remains challenging. In this paper, we propose GPT4NER, a method that prompts LLMs to resolve the few-shot NER task. GPT4NER constructs effective prompts using three key components: entity definition, few-shot examples, and chain-of-thought. By prompting LLMs with these effective prompts, GPT4NER transforms few-shot NER, which is traditionally considered as a sequence-labeling problem, into a sequence-generation problem. We conduct experiments on two benchmark datasets, CoNLL2003 and OntoNotes5.0, and compare the performance of GPT4NER to representative state-of-the-art models in both few-shot and fully supervised settings. Experimental results demonstrate that GPT4NER achieves the $F_1$ of 83.15\% on CoNLL2003 and 70.37\% on OntoNotes5.0, significantly outperforming few-shot baselines by an average margin of 7 points. Compared to fully-supervised baselines, GPT4NER achieves 87.9\% of their best performance on CoNLL2003 and 76.4\% of their best performance on OntoNotes5.0. We also utilize a relaxed-match metric for evaluation and report performance in the sub-task of named entity extraction (NEE), and experiments demonstrate their usefulness to help better understand model behaviors in the NER task.
