Table of Contents
Fetching ...

Dataset Diversity Metrics and Impact on Classification Models

Théo Sourget, Niclas Claßen, Jack Junchi Xu, Rob van der Goot, Veronika Cheplygina

Abstract

The diversity of training datasets is usually perceived as an important aspect to obtain a robust model. However, the definition of diversity is often not defined or differs across papers, and while some metrics exist, the quantification of this diversity is often overlooked when developing new algorithms. In this work, we study the behaviour of multiple dataset diversity metrics for image, text and metadata using MorphoMNIST, a toy dataset with controlled perturbations, and PadChest, a publicly available chest X-ray dataset. We evaluate whether these metrics correlate with each other but also with the intuition of a clinical expert. We also assess whether they correlate with downstream-task performance and how they impact the training dynamic of the models. We find limited correlations between the AUC and image or metadata reference-free diversity metrics, but higher correlations with the FID and the semantic diversity metrics. Finally, the clinical expert indicates that scanners are the main source of diversity in practice. However, we find that the addition of another scanner to the training set leads to shortcut learning. The code used in this study is available at https://github.com/TheoSourget/dataset_diversity_evaluation

Dataset Diversity Metrics and Impact on Classification Models

Abstract

The diversity of training datasets is usually perceived as an important aspect to obtain a robust model. However, the definition of diversity is often not defined or differs across papers, and while some metrics exist, the quantification of this diversity is often overlooked when developing new algorithms. In this work, we study the behaviour of multiple dataset diversity metrics for image, text and metadata using MorphoMNIST, a toy dataset with controlled perturbations, and PadChest, a publicly available chest X-ray dataset. We evaluate whether these metrics correlate with each other but also with the intuition of a clinical expert. We also assess whether they correlate with downstream-task performance and how they impact the training dynamic of the models. We find limited correlations between the AUC and image or metadata reference-free diversity metrics, but higher correlations with the FID and the semantic diversity metrics. Finally, the clinical expert indicates that scanners are the main source of diversity in practice. However, we find that the addition of another scanner to the training set leads to shortcut learning. The code used in this study is available at https://github.com/TheoSourget/dataset_diversity_evaluation
Paper Structure (21 sections, 3 equations, 6 figures, 3 tables)

This paper contains 21 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of our study. We assessed dataset diversity measures across multiple modalities and compare their correlations as well as their alignment with domain experts knowledge through interviews. We evaluate their impact on downstream task performances and the effect on subgroups training dynamics.
  • Figure 2: Example of data from MorphoMNIST
  • Figure 3: Example of data from both scanners in the PadChest dataset.
  • Figure 4: Correlations between the metrics' ranking of (a) MorphoMNIST and (b) chest X-rays scenarios using the Spearman's rank correlation.
  • Figure 5: Density plots of data maps for models trained with (a) plain images only, (b) plain and thin images, and (c) plain and swelling images for the MorphoMNIST dataset. A type of image is learnt faster if the density of the top left part is higher meaning that the probability in the correct class of this type of image is higher and more constant during the training.
  • ...and 1 more figures