Learning to Rank Context for Named Entity Recognition Using a Synthetic Dataset
Arthur Amalvy, Vincent Labatut, Richard Dufour
TL;DR
Addressing the transformer range limitation for long documents in NER, the paper proposes document-level context retrieval guided by a synthetic context-retrieval dataset generated with an instruction-tuned LLM. A BERT-based neural context retriever is trained to re-rank candidate sentences and is used as a re-ranker over a broad candidate pool, with top-$k$ contexts concatenated to the input sentence for NER. The approach is validated on a literary English NER dataset (40 novels, first chapters), showing gains over unsupervised baselines and competitive performance relative to supervised re-rankers, with the best configuration achieving $n=8$ and $k=3$ and a synthetic evaluation F1 of $98.01$. The work demonstrates that synthetic supervision from instruction-following LLMs can enable effective document-level retrieval for NER and suggests potential generalization to other tasks requiring global document context.
Abstract
While recent pre-trained transformer-based models can perform named entity recognition (NER) with great accuracy, their limited range remains an issue when applied to long documents such as whole novels. To alleviate this issue, a solution is to retrieve relevant context at the document level. Unfortunately, the lack of supervision for such a task means one has to settle for unsupervised approaches. Instead, we propose to generate a synthetic context retrieval training dataset using Alpaca, an instructiontuned large language model (LLM). Using this dataset, we train a neural context retriever based on a BERT model that is able to find relevant context for NER. We show that our method outperforms several retrieval baselines for the NER task on an English literary dataset composed of the first chapter of 40 books.
