Table of Contents
Fetching ...

Dataset Difficulty and the Role of Inductive Bias

Devin Kwok, Nikhil Anand, Jonathan Frankle, Gintare Karolina Dziugaite, David Rolnick

TL;DR

This work systematically evaluates example difficulty scores by decomposing variability into variance, covariance, and bias across random initializations, scoring methods, and architectural inductive biases. Using CIFAR-10 with ResNet-20 and model-variant experiments, it reveals substantial per-run variance, high cross-score correlations, and a single dominant direction of shared difficulty. It also shows that a small set of highly sensitive examples can fingerprint inductive biases, enabling architecture classification with simple models, while cautioning that rankings can be unstable when averaging over few runs. The findings establish baselines for evaluating difficulty scores and provide practical guidance on score selection, run budgets, and cross-architecture comparisons.

Abstract

Motivated by the goals of dataset pruning and defect identification, a growing body of methods have been developed to score individual examples within a dataset. These methods, which we call "example difficulty scores", are typically used to rank or categorize examples, but the consistency of rankings between different training runs, scoring methods, and model architectures is generally unknown. To determine how example rankings vary due to these random and controlled effects, we systematically compare different formulations of scores over a range of runs and model architectures. We find that scores largely share the following traits: they are noisy over individual runs of a model, strongly correlated with a single notion of difficulty, and reveal examples that range from being highly sensitive to insensitive to the inductive biases of certain model architectures. Drawing from statistical genetics, we develop a simple method for fingerprinting model architectures using a few sensitive examples. These findings guide practitioners in maximizing the consistency of their scores (e.g. by choosing appropriate scoring methods, number of runs, and subsets of examples), and establishes comprehensive baselines for evaluating scores in the future.

Dataset Difficulty and the Role of Inductive Bias

TL;DR

This work systematically evaluates example difficulty scores by decomposing variability into variance, covariance, and bias across random initializations, scoring methods, and architectural inductive biases. Using CIFAR-10 with ResNet-20 and model-variant experiments, it reveals substantial per-run variance, high cross-score correlations, and a single dominant direction of shared difficulty. It also shows that a small set of highly sensitive examples can fingerprint inductive biases, enabling architecture classification with simple models, while cautioning that rankings can be unstable when averaging over few runs. The findings establish baselines for evaluating difficulty scores and provide practical guidance on score selection, run budgets, and cross-architecture comparisons.

Abstract

Motivated by the goals of dataset pruning and defect identification, a growing body of methods have been developed to score individual examples within a dataset. These methods, which we call "example difficulty scores", are typically used to rank or categorize examples, but the consistency of rankings between different training runs, scoring methods, and model architectures is generally unknown. To determine how example rankings vary due to these random and controlled effects, we systematically compare different formulations of scores over a range of runs and model architectures. We find that scores largely share the following traits: they are noisy over individual runs of a model, strongly correlated with a single notion of difficulty, and reveal examples that range from being highly sensitive to insensitive to the inductive biases of certain model architectures. Drawing from statistical genetics, we develop a simple method for fingerprinting model architectures using a few sensitive examples. These findings guide practitioners in maximizing the consistency of their scores (e.g. by choosing appropriate scoring methods, number of runs, and subsets of examples), and establishes comprehensive baselines for evaluating scores in the future.
Paper Structure (12 sections, 6 figures)

This paper contains 12 sections, 6 figures.

Figures (6)

  • Figure 1: Left: Top 8 CIFAR-10 examples most/least sensitive to inductive biases (model width, depth, and architecture), ranked by statistical significance of changes to difficulty score for pairwise model comparisons (see \ref{['sec:bias']} for more details). Numbers indicate mean $-\log P$ value of each example. Right: Distinguishing between VGG and ResNet using the top 8 examples (Left) as features for logistic regression.
  • Figure 2: Variance of selected difficulty scores for CIFAR-10 examples. Rows 1-2: Score distribution for individual examples (y-axis) and individual runs, ordered by the rank of their mean score (x-axis). To make the plots more legible, examples are binned into groups of 1000. Shaded region indicates 90% confidence interval. From top to bottom and left to right: mean loss over training, mean accuracy over training, area under margin (correct class probability minus top other class) over training, number of forgetting events, epoch when example is learned without further forgetting, gradient norm at epoch 20, $L^2$ norm of output probability error at epoch 20, mean per-pixel variance of gradients at input over training.
  • Figure 3: Variability in example difficulty scores due to sample size. Left and Center: Variability in ranking examples by their mean difficulty scores. Scores are averaged over a fixed number of runs (x-axis), and the median (Left) and 95% confidence interval (Center) of absolute rank change between two such sets of runs is shown (y-axis). Right: variability in 50% data splits found by thresholding mean difficulty scores. Number of runs that scores are averaged over (x-axis) is plotted against the fraction of examples that change classification between two such sets of runs (y-axis).
  • Figure 4: Spearman (rank) correlation between difficulty scores. To ensure positive correlations, scores are multiplied by $-1$ where required. Left: correlations between mean difficulty scores. All scores other than "Ensemble agreement" and "Holdout retraining" are averaged over 40 runs; the latter are pre-computed in carlini2019distribution, Center: difference in correlation when correlating mean scores (rows) with individual runs (columns), i.e. $\mathbb{E}_{Y}[Corr(\mathbb{E}_{X}[X], Y)] - Corr(\mathbb{E}_{X}[X], \mathbb{E}_Y[Y])$. Note that individual runs are not available for "Ensemble agreement" and "Holdout retraining" scores. Right: difference in correlation when correlating scores within individual runs scores, i.e. $\mathbb{E}_{X,Y}[Corr(X, Y)] - Corr(\mathbb{E}_{X}[X], Y)$.
  • Figure 5: Principal component analysis of mean difficulty scores. All scores are first transformed to the normal distribution via quantile transform. Left: linear coefficients for top-6 principal components accounting for $>98\%$ of variation (heatmap top), and ratio of explained variance per component (stacked bars bottom). Right: variation in data splits by 50% threshold, median absolute rank error, and 95% confidence interval for ranks when using the top principal component as a score. This plot is analogous to \ref{['fig:rank-threshold']}.
  • ...and 1 more figures