Table of Contents
Fetching ...

NoiseBench: Benchmarking the Impact of Real Label Noise on Named Entity Recognition

Elena Merdjanovska, Ansar Aynetdinov, Alan Akbik

TL;DR

NoiseBench, an NER benchmark consisting of clean training data corrupted with 6 types of real noise, including expert errors, crowdsourcing errors, automatic annotation errors and LLM errors is introduced and shown that current state-of-the-art models for noise-robust learning fall far short of their achievable upper bound.

Abstract

Available training data for named entity recognition (NER) often contains a significant percentage of incorrect labels for entity types and entity boundaries. Such label noise poses challenges for supervised learning and may significantly deteriorate model quality. To address this, prior work proposed various noise-robust learning approaches capable of learning from data with partially incorrect labels. These approaches are typically evaluated using simulated noise where the labels in a clean dataset are automatically corrupted. However, as we show in this paper, this leads to unrealistic noise that is far easier to handle than real noise caused by human error or semi-automatic annotation. To enable the study of the impact of various types of real noise, we introduce NoiseBench, an NER benchmark consisting of clean training data corrupted with 6 types of real noise, including expert errors, crowdsourcing errors, automatic annotation errors and LLM errors. We present an analysis that shows that real noise is significantly more challenging than simulated noise, and show that current state-of-the-art models for noise-robust learning fall far short of their theoretically achievable upper bound. We release NoiseBench to the research community.

NoiseBench: Benchmarking the Impact of Real Label Noise on Named Entity Recognition

TL;DR

NoiseBench, an NER benchmark consisting of clean training data corrupted with 6 types of real noise, including expert errors, crowdsourcing errors, automatic annotation errors and LLM errors is introduced and shown that current state-of-the-art models for noise-robust learning fall far short of their achievable upper bound.

Abstract

Available training data for named entity recognition (NER) often contains a significant percentage of incorrect labels for entity types and entity boundaries. Such label noise poses challenges for supervised learning and may significantly deteriorate model quality. To address this, prior work proposed various noise-robust learning approaches capable of learning from data with partially incorrect labels. These approaches are typically evaluated using simulated noise where the labels in a clean dataset are automatically corrupted. However, as we show in this paper, this leads to unrealistic noise that is far easier to handle than real noise caused by human error or semi-automatic annotation. To enable the study of the impact of various types of real noise, we introduce NoiseBench, an NER benchmark consisting of clean training data corrupted with 6 types of real noise, including expert errors, crowdsourcing errors, automatic annotation errors and LLM errors. We present an analysis that shows that real noise is significantly more challenging than simulated noise, and show that current state-of-the-art models for noise-robust learning fall far short of their theoretically achievable upper bound. We release NoiseBench to the research community.
Paper Structure (55 sections, 4 figures, 12 tables)

This paper contains 55 sections, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Examples of text snippets with correct labels (top row) and two types of noise: Real noise from crowdsourcing (middle row) and simulated class-dependent noise (bottom row). This introduces different types of errors: (a) partial matches of correct entity mentions, (b) a wrong type and a non-entity mention and (c) a missing entity. We qualitatively find real noise to be more plausible than simulated noise.
  • Figure 2: F1 scores on different subsets of entities in the test set: all, seen (clean), seen (noisy) and unseen.
  • Figure 4: Comparison of model performance during extended training, for the German dataset.
  • Figure 6: Memorization of label noise in DistilBert, using the pretrained model and a model with randomly initialized weights. The experiment was run for one noise type - Crowd++.