AQuA: A Benchmarking Tool for Label Quality Assessment

Mononito Goswami; Vedant Sanil; Arjun Choudhry; Arvind Srinivasan; Chalisa Udompanyawit; Artur Dubrawski

AQuA: A Benchmarking Tool for Label Quality Assessment

Mononito Goswami, Vedant Sanil, Arjun Choudhry, Arvind Srinivasan, Chalisa Udompanyawit, Artur Dubrawski

TL;DR

AQuA introduces a comprehensive benchmarking framework for evaluating label noise detection and cleaning methods across multiple data modalities and annotation regimes. It defines a design space for label error detection models, offers seven noise-injection strategies, and evaluates four state-of-the-art detectors with diverse downstream classifiers and metrics beyond accuracy. Large-scale experiments reveal that SimiFeat and CINCER excel at identifying mislabeled data, while deep models often remain robust to label noise, and results depend on modality and evaluation metric. The benchmark aims to enable objective, reproducible comparisons and to guide practitioners in selecting appropriate label-cleaning tools for real-world data, with ongoing plans to expand to multi-annotator and fairness-focused dimensions.

Abstract

Machine learning (ML) models are only as good as the data they are trained on. But recent studies have found datasets widely used to train and evaluate ML models, e.g. ImageNet, to have pervasive labeling errors. Erroneous labels on the train set hurt ML models' ability to generalize, and they impact evaluation and model selection using the test set. Consequently, learning in the presence of labeling errors is an active area of research, yet this field lacks a comprehensive benchmark to evaluate these methods. Most of these methods are evaluated on a few computer vision datasets with significant variance in the experimental protocols. With such a large pool of methods and inconsistent evaluation, it is also unclear how ML practitioners can choose the right models to assess label quality in their data. To this end, we propose a benchmarking environment AQuA to rigorously evaluate methods that enable machine learning in the presence of label noise. We also introduce a design space to delineate concrete design choices of label error detection models. We hope that our proposed design space and benchmark enable practitioners to choose the right tools to improve their label quality and that our benchmark enables objective and rigorous evaluation of machine learning tools facing mislabeled data.

AQuA: A Benchmarking Tool for Label Quality Assessment

TL;DR

Abstract

Paper Structure (70 sections, 1 equation, 10 figures, 27 tables)

This paper contains 70 sections, 1 equation, 10 figures, 27 tables.

Introduction
Background and Problem Formulation
A Design Space of Labeling Error Detection Models
Benchmark Design
Real-world, Popular Datasets, and Downstream Classification Models
Advanced Label Error Detection Methods
Evaluation
Experiments, Results and Discussion
Insights from Large-scale Experiments using AQuA
Conclusion and Future Work
Limitations, Biases, and Social Impacts
Appendix
A Design Space of Labeling Error Detection Models
Noise Transition Matrix.
Estimating $\mathbf{T}$ using Anchor Points.
...and 55 more sections

Figures (10)

Figure 1: Overview of the AQuA benchmark framework. AQuA comprises of datasets from 4 modalities, 4 single-label and 3 multi-annotator label noise injection methods, 4 state-of-the-art label error detection models, classification models, and several evaluation metrics beyond metrics of predictive accuracy. We are in the process of integrating several fairness, generalization, and robustness metrics into AQuA. The red and blue arrows show two example experimental pipelines for image data and time-series data, respectively.
Figure 2: Labeling errors in widely used benchmarks: CIFAR-10, Clothing-100K, MIT-BIH, and TweetEval Hate Speech datasets. Observed labels are in red and true labels are in green.
Figure 3: Design space of labeling error detection models to delineate concrete design choices.
Figure 4: AQuA makes identifying label issues, and evaluating new and existing label error detection models simple.
Figure 5: Critical difference diagrams representing rankings of cleaning methods across: (i) all datasets, (iii) only image or (iv) only text datasets. (v) also shows the ranking of cleaning methods across all datasets when accuracy is measured instead of weighted $F_1$ (c.f. i). Finally, (ii) represents the performance of downstream models trained using cleaned labels, and (vi) performance of all cleaning methods disaggregated by noise type.
...and 5 more figures

AQuA: A Benchmarking Tool for Label Quality Assessment

TL;DR

Abstract

AQuA: A Benchmarking Tool for Label Quality Assessment

Authors

TL;DR

Abstract

Table of Contents

Figures (10)