AlleNoise: large-scale text classification benchmark dataset with real-world label noise

Alicja Rączkowska; Aleksandra Osowska-Kurczab; Jacek Szczerbiński; Kalina Jasinska-Kobus; Klaudia Nazarko

AlleNoise: large-scale text classification benchmark dataset with real-world label noise

Alicja Rączkowska, Aleksandra Osowska-Kurczab, Jacek Szczerbiński, Kalina Jasinska-Kobus, Klaudia Nazarko

TL;DR

AlleNoise is presented, a new curated text classification benchmark dataset with real-world instance-dependent label noise, containing over 500,000 examples across approximately 5,600 classes, complemented with a meaningful, hierarchical taxonomy of categories.

Abstract

Label noise remains a challenge for training robust classification models. Most methods for mitigating label noise have been benchmarked using primarily datasets with synthetic noise. While the need for datasets with realistic noise distribution has partially been addressed by web-scraped benchmarks such as WebVision and Clothing1M, those benchmarks are restricted to the computer vision domain. With the growing importance of Transformer-based models, it is crucial to establish text classification benchmarks for learning with noisy labels. In this paper, we present AlleNoise, a new curated text classification benchmark dataset with real-world instance-dependent label noise, containing over 500,000 examples across approximately 5,600 classes, complemented with a meaningful, hierarchical taxonomy of categories. The noise distribution comes from actual users of a major e-commerce marketplace, so it realistically reflects the semantics of human mistakes. In addition to the noisy labels, we provide human-verified clean labels, which help to get a deeper insight into the noise distribution, unlike web-scraped datasets typically used in the field. We demonstrate that a representative selection of established methods for learning with noisy labels is inadequate to handle such real-world noise. In addition, we show evidence that these algorithms do not alleviate excessive memorization. As such, with AlleNoise, we set the bar high for the development of label noise methods that can handle real-world label noise in text classification tasks. The code and dataset are available for download at https://github.com/allegro/AlleNoise.

AlleNoise: large-scale text classification benchmark dataset with real-world label noise

TL;DR

Abstract

Paper Structure (30 sections, 13 figures, 9 tables)

This paper contains 30 sections, 13 figures, 9 tables.

Introduction
Related work
AlleNoise Dataset Construction
Real-world noise
Clean data sampling
Post-processing
Methods
Problem statement
Synthetic noise generation
Model architecture
Evaluation metrics
Benchmarked methods
Results
Synthetic noise vs AlleNoise
Noise type impacts memorization
...and 15 more sections

Figures (13)

Figure 1: Symmetric noise vs. AlleNoise in examples. Correct and noisy labels are marked in green and red, respectively. (a) Symmetric noise: an electric toothbrush incorrectly labeled as a winter tire is easy to spot, even for an untrained human. (b)AlleNoise: a ceiling dome is mislabeled as a pendant lamp. This error is semantically challenging and hard to detect. Note: AlleNoise dataset does not include images.
Figure 2: AlleNoise consists of two tables: the first table includes the true and noisy label for each product title, while the second table maps the labels to category names.
Figure 3: Memorization and correctness metrics as a function of the training step. (a) The value of $\texttt{memorized}_{val}$ for synthetic noise types. (b) The value of $\texttt{memorized}_{val}$ for AlleNoise. (c) The value of $\texttt{correct}_{val}^{\texttt{clean}}$ for AlleNoise. (d) The value of $\texttt{correct}_{val}^{\texttt{noisy}}$ for AlleNoise.
Figure 4: Noise distribution and patterns of wrong predictions across different noise types. (a) Noise level distribution over target categories (blue bars) shows that AlleNoise has a substantial fraction of classes with noise level over 0.5, contrary to synthetic noise. The same distribution multiplied by per-bin macro accuracy (yellow bars) shows that those specialized categories are particularly difficult to predict correctly. (b) Scatter plot of true noise level versus observed noise level in each category for pair-flip noise. Marker color represents accuracy, and marker size reflects category size. True noise levels are concentrated around 15%, with no distinct specialized or archetypal categories observed. The plot includes only categories with at least 10 products. (c) Scatter plot for real-world AlleNoise, highlighting the presence of many specialized and archetypal categories. Accuracy in specialized categories is negatively correlated with the true noise level. A significant number of categories exhibit both high true noise and high observed noise levels. Scatter plots for other noise types are presented in Fig. \ref{['fig:scatter_plots']}, Appendix \ref{['appendix:uneven_gains']}.
Figure S1: Value of $\texttt{memorized}_{val}$ for different noise types, measured at each training step. In all cases the noise level was set at 40%.
...and 8 more figures

AlleNoise: large-scale text classification benchmark dataset with real-world label noise

TL;DR

Abstract

AlleNoise: large-scale text classification benchmark dataset with real-world label noise

Authors

TL;DR

Abstract

Table of Contents

Figures (13)