Beyond Size and Class Balance: Alpha as a New Dataset Quality Metric for Deep Learning

Josiah Couch; Rima Arnaout; Ramy Arnaout

Beyond Size and Class Balance: Alpha as a New Dataset Quality Metric for Deep Learning

Josiah Couch, Rima Arnaout, Ramy Arnaout

TL;DR

A comprehensive framework of diversity measures from ecology that generalizes familiar quantities like Shannon entropy by accounting for similarities among images is introduced, proposing maximizing A as a way to improve deep learning performance in medical imaging.

Abstract

In deep learning, achieving high performance on image classification tasks requires diverse training sets. However, the current best practice$\unicode{x2013}$maximizing dataset size and class balance$\unicode{x2013}$does not guarantee dataset diversity. We hypothesized that, for a given model architecture, model performance can be improved by maximizing diversity more directly. To test this hypothesis, we introduce a comprehensive framework of diversity measures from ecology that generalizes familiar quantities like Shannon entropy by accounting for similarities among images. (Size and class balance emerge as special cases.) Analyzing thousands of subsets from seven medical datasets showed that the best correlates of performance were not size or class balance but $A$$\unicode{x2013}$"big alpha"$\unicode{x2013}$a set of generalized entropy measures interpreted as the effective number of image-class pairs in the dataset, after accounting for image similarities. One of these, $A_0$, explained 67% of the variance in balanced accuracy, vs. 54% for class balance and just 39% for size. The best pair of measures was size-plus-$A_1$ (79%), which outperformed size-plus-class-balance (74%). Subsets with the largest $A_0$ performed up to 16% better than those with the largest size (median improvement, 8%). We propose maximizing $A$ as a way to improve deep learning performance in medical imaging.

Beyond Size and Class Balance: Alpha as a New Dataset Quality Metric for Deep Learning

TL;DR

Abstract

In deep learning, achieving high performance on image classification tasks requires diverse training sets. However, the current best practice

maximizing dataset size and class balance

does not guarantee dataset diversity. We hypothesized that, for a given model architecture, model performance can be improved by maximizing diversity more directly. To test this hypothesis, we introduce a comprehensive framework of diversity measures from ecology that generalizes familiar quantities like Shannon entropy by accounting for similarities among images. (Size and class balance emerge as special cases.) Analyzing thousands of subsets from seven medical datasets showed that the best correlates of performance were not size or class balance but

"big alpha"

a set of generalized entropy measures interpreted as the effective number of image-class pairs in the dataset, after accounting for image similarities. One of these,

, explained 67% of the variance in balanced accuracy, vs. 54% for class balance and just 39% for size. The best pair of measures was size-plus-

(79%), which outperformed size-plus-class-balance (74%). Subsets with the largest

performed up to 16% better than those with the largest size (median improvement, 8%). We propose maximizing

as a way to improve deep learning performance in medical imaging.

Paper Structure (23 sections, 2 equations, 5 figures, 3 tables)

This paper contains 23 sections, 2 equations, 5 figures, 3 tables.

Introduction
Relationship to prior work
Background
Dataset diversity as a function of element frequencies and similarities
Hill's D-number framework and element frequencies
The LCR framework and element similarities
Dataset size and class balance as special cases of LCR
Class-level and dataset-level diversities
Interpretation
Methods
Selecting datasets
Creating dataset subsets
Model training and performance/quality assessment
Definition of pairwise image similarity and measurement of diversity features
Quality indicators/regression features
...and 8 more sections

Figures (5)

Figure 1: Sensible definitions of diversity must be sensitive to both frequency and similarity. Four same-sized datasets of 10 images each from the MNIST handwritten digits dataset mnist are shown. The two datasets in (a) contain the same five unique images, differing only in their relative frequencies; the more balanced dataset is intuitively more diverse. The two datasets in (b) each contain 10 unique images but differ in how similar the images are to each other; the dataset with the more-different images is intuitively more diverse. See also e.g. jost2007greylock.
Figure 2: Interpreting the diversity of a dataset by visualizing soft clusters within the similarity matrix. Right: four randomly chosen images from different regions of the clustered heatmap, one from each of the four classes.
Figure 3: (a) Class balance vs. subset size, (b) BACC vs. subset size, and (c) BACC vs. $A_0$ (normalized by number of classes) for each dataset. Boxes span first and third quartiles; whiskers span an additional 1.5 interquartiles (the matplotlib.pyplot.boxplot default settings). Numbers in (b) and (c) indicate the median BACC for the top bin.
Figure 4: Regression results for (a) top 10 single measures, (b) top 10 pairs of measures, (c) top 10 sets of three measures, and (d) other sets of interest.
Figure 5: $R^2$ among diversity measures (clustered). Lines delimit highly correlated clusters.

Beyond Size and Class Balance: Alpha as a New Dataset Quality Metric for Deep Learning

TL;DR

Abstract

Beyond Size and Class Balance: Alpha as a New Dataset Quality Metric for Deep Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)