Intrinsic Self-Supervision for Data Quality Audits

Fabian Gröger; Simone Lionetti; Philippe Gottfrois; Alvaro Gonzalez-Jimenez; Ludovic Amruthalingam; Labelling Consortium; Matthew Groh; Alexander A. Navarini; Marc Pouly

Intrinsic Self-Supervision for Data Quality Audits

Fabian Gröger, Simone Lionetti, Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Ludovic Amruthalingam, Labelling Consortium, Matthew Groh, Alexander A. Navarini, Marc Pouly

TL;DR

It is found that a specific combination of context-aware self-supervised representation learning and distance-based indicators is effective in finding issues without annotation biases within widely-used image datasets, both for synthetic issues and real contamination.

Abstract

Benchmark datasets in computer vision often contain off-topic images, near duplicates, and label errors, leading to inaccurate estimates of model performance. In this paper, we revisit the task of data cleaning and formalize it as either a ranking problem, which significantly reduces human inspection effort, or a scoring problem, which allows for automated decisions based on score distributions. We find that a specific combination of context-aware self-supervised representation learning and distance-based indicators is effective in finding issues without annotation biases. This methodology, which we call SelfClean, surpasses state-of-the-art performance in detecting off-topic images, near duplicates, and label errors within widely-used image datasets, such as ImageNet-1k, Food-101N, and STL-10, both for synthetic issues and real contamination. We apply the detailed method to multiple image benchmarks, identify up to 16% of issues, and confirm an improvement in evaluation reliability upon cleaning. The official implementation can be found at: https://github.com/Digital-Dermatology/SelfClean.

Intrinsic Self-Supervision for Data Quality Audits

TL;DR

Abstract

Paper Structure (54 sections, 7 equations, 25 figures, 14 tables)

This paper contains 54 sections, 7 equations, 25 figures, 14 tables.

Introduction
Related work
Methodology
Representation learning
Distance-based indicators
Operation modes
Experimental setup
Results
Synthetic contamination
Natural contamination
Influence of representation learning
Discussion
Conclusion and outlook
Appendix
Broader impact
...and 39 more sections

Figures (25)

Figure 1: SelfClean first trains a self-supervised encoder on noisy data to obtain latent representations for dataset samples. It then detects off-topic samples with agglomerative clustering, near duplicates based on pairwise distances, and label errors using the intra-/extra- class distance ratio.
Figure 2: Illustration of synthetic data quality issues of all three types in STL-10, VinDR, and DDI.
Figure 3: Performance of the best two approaches for each issue type to SelfClean across different representations for a mixed-contamination strategy at varying contamination rates. Gray regions indicate random performance with an ap equal to the respective contamination $C_S$.
Figure 4: Performance of SelfClean when changing the distance function and removing the $L_2$-normalization. The performance is measured in terms of *ap for a mixed-contamination strategy when varying the contamination rate. The artificial dataset is created from DDI by adding off-topic samples (BLUR), then injecting augmented duplicates (ARTE), and finally changing labels at random (LBL). Shaded regions indicate random performance.
Figure 5: Performance of SelfClean during pre-training. The performance is measured in terms of *ap for a 10% mixed-contamination strategy. The artificial dataset is created from DDI by adding off-topic samples (BLUR), then injecting augmented duplicates (ARTE), and finally changing labels at random (LBL). Shaded regions indicate random performance.
...and 20 more figures

Intrinsic Self-Supervision for Data Quality Audits

TL;DR

Abstract

Intrinsic Self-Supervision for Data Quality Audits

Authors

TL;DR

Abstract

Table of Contents

Figures (25)