Table of Contents
Fetching ...

Intrinsic Self-Supervision for Data Quality Audits

Fabian Gröger, Simone Lionetti, Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Ludovic Amruthalingam, Labelling Consortium, Matthew Groh, Alexander A. Navarini, Marc Pouly

TL;DR

It is found that a specific combination of context-aware self-supervised representation learning and distance-based indicators is effective in finding issues without annotation biases within widely-used image datasets, both for synthetic issues and real contamination.

Abstract

Benchmark datasets in computer vision often contain off-topic images, near duplicates, and label errors, leading to inaccurate estimates of model performance. In this paper, we revisit the task of data cleaning and formalize it as either a ranking problem, which significantly reduces human inspection effort, or a scoring problem, which allows for automated decisions based on score distributions. We find that a specific combination of context-aware self-supervised representation learning and distance-based indicators is effective in finding issues without annotation biases. This methodology, which we call SelfClean, surpasses state-of-the-art performance in detecting off-topic images, near duplicates, and label errors within widely-used image datasets, such as ImageNet-1k, Food-101N, and STL-10, both for synthetic issues and real contamination. We apply the detailed method to multiple image benchmarks, identify up to 16% of issues, and confirm an improvement in evaluation reliability upon cleaning. The official implementation can be found at: https://github.com/Digital-Dermatology/SelfClean.

Intrinsic Self-Supervision for Data Quality Audits

TL;DR

It is found that a specific combination of context-aware self-supervised representation learning and distance-based indicators is effective in finding issues without annotation biases within widely-used image datasets, both for synthetic issues and real contamination.

Abstract

Benchmark datasets in computer vision often contain off-topic images, near duplicates, and label errors, leading to inaccurate estimates of model performance. In this paper, we revisit the task of data cleaning and formalize it as either a ranking problem, which significantly reduces human inspection effort, or a scoring problem, which allows for automated decisions based on score distributions. We find that a specific combination of context-aware self-supervised representation learning and distance-based indicators is effective in finding issues without annotation biases. This methodology, which we call SelfClean, surpasses state-of-the-art performance in detecting off-topic images, near duplicates, and label errors within widely-used image datasets, such as ImageNet-1k, Food-101N, and STL-10, both for synthetic issues and real contamination. We apply the detailed method to multiple image benchmarks, identify up to 16% of issues, and confirm an improvement in evaluation reliability upon cleaning. The official implementation can be found at: https://github.com/Digital-Dermatology/SelfClean.
Paper Structure (54 sections, 7 equations, 25 figures, 14 tables)

This paper contains 54 sections, 7 equations, 25 figures, 14 tables.

Figures (25)

  • Figure 1: SelfClean first trains a self-supervised encoder on noisy data to obtain latent representations for dataset samples. It then detects off-topic samples with agglomerative clustering, near duplicates based on pairwise distances, and label errors using the intra-/extra- class distance ratio.
  • Figure 2: Illustration of synthetic data quality issues of all three types in STL-10, VinDR, and DDI.
  • Figure 3: Performance of the best two approaches for each issue type to SelfClean across different representations for a mixed-contamination strategy at varying contamination rates. Gray regions indicate random performance with an ap equal to the respective contamination $C_S$.
  • Figure 4: Performance of SelfClean when changing the distance function and removing the $L_2$-normalization. The performance is measured in terms of *ap for a mixed-contamination strategy when varying the contamination rate. The artificial dataset is created from DDI by adding off-topic samples (BLUR), then injecting augmented duplicates (ARTE), and finally changing labels at random (LBL). Shaded regions indicate random performance.
  • Figure 5: Performance of SelfClean during pre-training. The performance is measured in terms of *ap for a 10% mixed-contamination strategy. The artificial dataset is created from DDI by adding off-topic samples (BLUR), then injecting augmented duplicates (ARTE), and finally changing labels at random (LBL). Shaded regions indicate random performance.
  • ...and 20 more figures