CoNLL#: Fine-grained Error Analysis and a Corrected Test Set for CoNLL-03 English

Andrew Rueda; Elena Álvarez Mellado; Constantine Lignos

CoNLL#: Fine-grained Error Analysis and a Corrected Test Set for CoNLL-03 English

Andrew Rueda, Elena Álvarez Mellado, Constantine Lignos

TL;DR

This work investigates why state-of-the-art NER models plateau on CoNLL-03 English by performing a fine-grained error analysis with document-level annotations and then creating CoNLL#—a corrected test set that adjudicates past corrections and fixes additional systematic issues. Three SOTA models are evaluated on both the original and corrected test sets, revealing noticeable F1 gains (over 2 points) and revealing persistent error patterns, especially in economy-domain documents. The study highlights that many errors originate from annotation and boundary/tokenization issues in the test data, justifying the need for a low-noise evaluation framework to accurately diagnose remaining NER challenges. The corrected dataset and its analysis offer a practical path to more reliable benchmarking and can guide future improvements across languages and datasets, mitigating data-noise as a confounding factor in progress reports.

Abstract

Modern named entity recognition systems have steadily improved performance in the age of larger and more powerful neural models. However, over the past several years, the state-of-the-art has seemingly hit another plateau on the benchmark CoNLL-03 English dataset. In this paper, we perform a deep dive into the test outputs of the highest-performing NER models, conducting a fine-grained evaluation of their performance by introducing new document-level annotations on the test set. We go beyond F1 scores by categorizing errors in order to interpret the true state of the art for NER and guide future work. We review previous attempts at correcting the various flaws of the test set and introduce CoNLL#, a new corrected version of the test set that addresses its systematic and most prevalent errors, allowing for low-noise, interpretable error analysis.

CoNLL#: Fine-grained Error Analysis and a Corrected Test Set for CoNLL-03 English

TL;DR

Abstract

CoNLL#: Fine-grained Error Analysis and a Corrected Test Set for CoNLL-03 English

Authors

TL;DR

Abstract

Table of Contents