Table of Contents
Fetching ...

The Effect of Intrinsic Dataset Properties on Generalization: Unraveling Learning Differences Between Natural and Medical Images

Nicholas Konz, Maciej A. Mazurowski

TL;DR

This work addresses why generalization scaling with intrinsic dataset properties differs across imaging domains. It derives and validates a generalization scaling law, $L = \mathcal{O}(K_{\mathcal{F}} N^{-1/d_{\mathrm{data}}})$, and introduces dataset label sharpness $K_{\mathcal{F}}$ as a domain-sensitive factor, with medical images typically exhibiting higher $K_{\mathcal{F}}$ than natural images. The authors also connect robustness to label sharpness, showing adversarial vulnerability grows as $1/K_{\mathcal{F}}$, and extend the analysis to learned representations, establishing $d_{\mathrm{repr}}$ as an upper bound by $d_{\mathrm{data}}$ under reasonable Lipschitz assumptions. Across six models and eleven datasets, the results illuminate how intrinsic dataset properties govern generalization, representation learning, and robustness, with practical implications for medical imaging where annotation and attack resilience are critical; code is publicly available to enable further exploration.

Abstract

This paper investigates discrepancies in how neural networks learn from different imaging domains, which are commonly overlooked when adopting computer vision techniques from the domain of natural images to other specialized domains such as medical images. Recent works have found that the generalization error of a trained network typically increases with the intrinsic dimension ($d_{data}$) of its training set. Yet, the steepness of this relationship varies significantly between medical (radiological) and natural imaging domains, with no existing theoretical explanation. We address this gap in knowledge by establishing and empirically validating a generalization scaling law with respect to $d_{data}$, and propose that the substantial scaling discrepancy between the two considered domains may be at least partially attributed to the higher intrinsic ``label sharpness'' ($K_\mathcal{F}$) of medical imaging datasets, a metric which we propose. Next, we demonstrate an additional benefit of measuring the label sharpness of a training set: it is negatively correlated with the trained model's adversarial robustness, which notably leads to models for medical images having a substantially higher vulnerability to adversarial attack. Finally, we extend our $d_{data}$ formalism to the related metric of learned representation intrinsic dimension ($d_{repr}$), derive a generalization scaling law with respect to $d_{repr}$, and show that $d_{data}$ serves as an upper bound for $d_{repr}$. Our theoretical results are supported by thorough experiments with six models and eleven natural and medical imaging datasets over a range of training set sizes. Our findings offer insights into the influence of intrinsic dataset properties on generalization, representation learning, and robustness in deep neural networks. Code link: https://github.com/mazurowski-lab/intrinsic-properties

The Effect of Intrinsic Dataset Properties on Generalization: Unraveling Learning Differences Between Natural and Medical Images

TL;DR

This work addresses why generalization scaling with intrinsic dataset properties differs across imaging domains. It derives and validates a generalization scaling law, , and introduces dataset label sharpness as a domain-sensitive factor, with medical images typically exhibiting higher than natural images. The authors also connect robustness to label sharpness, showing adversarial vulnerability grows as , and extend the analysis to learned representations, establishing as an upper bound by under reasonable Lipschitz assumptions. Across six models and eleven datasets, the results illuminate how intrinsic dataset properties govern generalization, representation learning, and robustness, with practical implications for medical imaging where annotation and attack resilience are critical; code is publicly available to enable further exploration.

Abstract

This paper investigates discrepancies in how neural networks learn from different imaging domains, which are commonly overlooked when adopting computer vision techniques from the domain of natural images to other specialized domains such as medical images. Recent works have found that the generalization error of a trained network typically increases with the intrinsic dimension () of its training set. Yet, the steepness of this relationship varies significantly between medical (radiological) and natural imaging domains, with no existing theoretical explanation. We address this gap in knowledge by establishing and empirically validating a generalization scaling law with respect to , and propose that the substantial scaling discrepancy between the two considered domains may be at least partially attributed to the higher intrinsic ``label sharpness'' () of medical imaging datasets, a metric which we propose. Next, we demonstrate an additional benefit of measuring the label sharpness of a training set: it is negatively correlated with the trained model's adversarial robustness, which notably leads to models for medical images having a substantially higher vulnerability to adversarial attack. Finally, we extend our formalism to the related metric of learned representation intrinsic dimension (), derive a generalization scaling law with respect to , and show that serves as an upper bound for . Our theoretical results are supported by thorough experiments with six models and eleven natural and medical imaging datasets over a range of training set sizes. Our findings offer insights into the influence of intrinsic dataset properties on generalization, representation learning, and robustness in deep neural networks. Code link: https://github.com/mazurowski-lab/intrinsic-properties
Paper Structure (42 sections, 5 theorems, 23 equations, 27 figures, 7 tables)

This paper contains 42 sections, 5 theorems, 23 equations, 27 figures, 7 tables.

Key Result

Theorem 1

Let $L$, $f$ and $\mathcal{F}$ be Lipschitz on $\mathcal{M}_{d_\mathrm{data}}$ with respective constants $K_L$, $K_f$ and $K_\mathcal{F}$. Further let $\mathcal{D}_\mathrm{train}$ be a training set of size $N$ sampled i.i.d. from $\mathcal{M}_{d_\mathrm{data}}$, with $f(x)=\mathcal{F}(x)$ for all $x

Figures (27)

  • Figure 1: Measured intrinsic dimension ($d_\mathrm{data}$, left) and label sharpnesses ($\hat{K}_\mathcal{F}$, right) of the natural (orange) and medical (blue) image datasets which we analyze (Sec. \ref{['sec:data']}). $\hat{K}_\mathcal{F}$ is typically higher for the medical datasets. $d_\mathrm{data}$ values are averaged over all training set sizes, and $\hat{K}_\mathcal{F}$ over all class pairings (Sec. \ref{['sec:KFest']}); error bars indicate $95\%$ confidence intervals.
  • Figure 2: Scaling of log test set loss/generalization ability with training dataset intrinsic dimension ($d_\mathrm{data}$) for natural and medical datasets. Each point corresponds to a (model, dataset, training set size) triplet. Medical dataset results are shown in blue shades, and natural dataset results are shown in red; note the difference in generalization error scaling rate between the two imaging domains. Standard deviation error bars are shown for natural image datasets for 5 different class pairs.
  • Figure 3: Test set loss penalty due to FGSM adversarial attack vs. measured dataset label sharpness ($\hat{K}_\mathcal{F}$) for models trained on natural and medical image datasets (orange and blue points, respectively). Pearson correlation coefficient $r$ also shown. Error bars are $95\%$ confidence intervals over all training set sizes $N$ for the same dataset.
  • Figure 4: Scaling of log test set loss/generalization ability with the intrinsic dimension of final hidden layer learned representations of the training set ($d_\mathrm{repr}$), for natural and medical datasets. Each point corresponds to a (model, dataset, training set size) triplet. Medical dataset results are shown in blue shades, and natural dataset results are shown in red.
  • Figure 5: Training set intrinsic dimension upper-bounds learned representation intrinsic dimension. Each point corresponds to a (model, dataset, training set size) triplet.
  • ...and 22 more figures

Theorems & Definitions (11)

  • Theorem 1: Generalization Error and Dataset Intrinsic Dim. Scaling Law bahri2021explaining
  • Theorem 2: Approximating $K_f$ with $K_\mathcal{F}$
  • Theorem 3: Adversarial Robustness and Label Sharpness Scaling Law
  • proof
  • Theorem 4: Generalization Error and Learned Representation Intrinsic Dimension Scaling Law
  • Theorem 5: Bounding of Representation Intrinsic Dim. with Dataset Intrinsic Dim.
  • proof
  • proof
  • proof
  • proof
  • ...and 1 more