The Effect of Intrinsic Dataset Properties on Generalization: Unraveling Learning Differences Between Natural and Medical Images

Nicholas Konz; Maciej A. Mazurowski

The Effect of Intrinsic Dataset Properties on Generalization: Unraveling Learning Differences Between Natural and Medical Images

Nicholas Konz, Maciej A. Mazurowski

TL;DR

This work addresses why generalization scaling with intrinsic dataset properties differs across imaging domains. It derives and validates a generalization scaling law, $L = \mathcal{O}(K_{\mathcal{F}} N^{-1/d_{\mathrm{data}}})$, and introduces dataset label sharpness $K_{\mathcal{F}}$ as a domain-sensitive factor, with medical images typically exhibiting higher $K_{\mathcal{F}}$ than natural images. The authors also connect robustness to label sharpness, showing adversarial vulnerability grows as $1/K_{\mathcal{F}}$, and extend the analysis to learned representations, establishing $d_{\mathrm{repr}}$ as an upper bound by $d_{\mathrm{data}}$ under reasonable Lipschitz assumptions. Across six models and eleven datasets, the results illuminate how intrinsic dataset properties govern generalization, representation learning, and robustness, with practical implications for medical imaging where annotation and attack resilience are critical; code is publicly available to enable further exploration.

Abstract

This paper investigates discrepancies in how neural networks learn from different imaging domains, which are commonly overlooked when adopting computer vision techniques from the domain of natural images to other specialized domains such as medical images. Recent works have found that the generalization error of a trained network typically increases with the intrinsic dimension ($d_{data}$) of its training set. Yet, the steepness of this relationship varies significantly between medical (radiological) and natural imaging domains, with no existing theoretical explanation. We address this gap in knowledge by establishing and empirically validating a generalization scaling law with respect to $d_{data}$, and propose that the substantial scaling discrepancy between the two considered domains may be at least partially attributed to the higher intrinsic ``label sharpness'' ($K_\mathcal{F}$) of medical imaging datasets, a metric which we propose. Next, we demonstrate an additional benefit of measuring the label sharpness of a training set: it is negatively correlated with the trained model's adversarial robustness, which notably leads to models for medical images having a substantially higher vulnerability to adversarial attack. Finally, we extend our $d_{data}$ formalism to the related metric of learned representation intrinsic dimension ($d_{repr}$), derive a generalization scaling law with respect to $d_{repr}$, and show that $d_{data}$ serves as an upper bound for $d_{repr}$. Our theoretical results are supported by thorough experiments with six models and eleven natural and medical imaging datasets over a range of training set sizes. Our findings offer insights into the influence of intrinsic dataset properties on generalization, representation learning, and robustness in deep neural networks. Code link: https://github.com/mazurowski-lab/intrinsic-properties

The Effect of Intrinsic Dataset Properties on Generalization: Unraveling Learning Differences Between Natural and Medical Images

TL;DR

This work addresses why generalization scaling with intrinsic dataset properties differs across imaging domains. It derives and validates a generalization scaling law,

, and introduces dataset label sharpness

as a domain-sensitive factor, with medical images typically exhibiting higher

than natural images. The authors also connect robustness to label sharpness, showing adversarial vulnerability grows as

, and extend the analysis to learned representations, establishing

as an upper bound by

under reasonable Lipschitz assumptions. Across six models and eleven datasets, the results illuminate how intrinsic dataset properties govern generalization, representation learning, and robustness, with practical implications for medical imaging where annotation and attack resilience are critical; code is publicly available to enable further exploration.

Abstract

) of its training set. Yet, the steepness of this relationship varies significantly between medical (radiological) and natural imaging domains, with no existing theoretical explanation. We address this gap in knowledge by establishing and empirically validating a generalization scaling law with respect to

, and propose that the substantial scaling discrepancy between the two considered domains may be at least partially attributed to the higher intrinsic ``label sharpness'' (

) of medical imaging datasets, a metric which we propose. Next, we demonstrate an additional benefit of measuring the label sharpness of a training set: it is negatively correlated with the trained model's adversarial robustness, which notably leads to models for medical images having a substantially higher vulnerability to adversarial attack. Finally, we extend our

formalism to the related metric of learned representation intrinsic dimension (

), derive a generalization scaling law with respect to

, and show that

serves as an upper bound for

. Our theoretical results are supported by thorough experiments with six models and eleven natural and medical imaging datasets over a range of training set sizes. Our findings offer insights into the influence of intrinsic dataset properties on generalization, representation learning, and robustness in deep neural networks. Code link: https://github.com/mazurowski-lab/intrinsic-properties

Paper Structure (42 sections, 5 theorems, 23 equations, 27 figures, 7 tables)

This paper contains 42 sections, 5 theorems, 23 equations, 27 figures, 7 tables.

Introduction
Related Works
Preliminaries
Estimating Dataset Intrinsic Dimension
Estimating Dataset Label Sharpness
Datasets, Models and Training
Medical Image Datasets.
Natural Image Datasets.
Models and training.
The Relationship of Generalization with Dataset Intrinsic Dimension and Label Sharpness
Bounding generalization ability with dataset intrinsic dimension
Generalization Discrepancies Between Imaging Domains
Adversarial Robustness and Training Set Label Sharpness
Connecting Representation Intrinsic Dimension to Dataset Intrinsic Dimension and Generalization
Supplementary Materials
...and 27 more sections

Key Result

Theorem 1

Let $L$, $f$ and $\mathcal{F}$ be Lipschitz on $\mathcal{M}_{d_\mathrm{data}}$ with respective constants $K_L$, $K_f$ and $K_\mathcal{F}$. Further let $\mathcal{D}_\mathrm{train}$ be a training set of size $N$ sampled i.i.d. from $\mathcal{M}_{d_\mathrm{data}}$, with $f(x)=\mathcal{F}(x)$ for all $x

Figures (27)

Figure 1: Measured intrinsic dimension ($d_\mathrm{data}$, left) and label sharpnesses ($\hat{K}_\mathcal{F}$, right) of the natural (orange) and medical (blue) image datasets which we analyze (Sec. \ref{['sec:data']}). $\hat{K}_\mathcal{F}$ is typically higher for the medical datasets. $d_\mathrm{data}$ values are averaged over all training set sizes, and $\hat{K}_\mathcal{F}$ over all class pairings (Sec. \ref{['sec:KFest']}); error bars indicate $95\%$ confidence intervals.
Figure 2: Scaling of log test set loss/generalization ability with training dataset intrinsic dimension ($d_\mathrm{data}$) for natural and medical datasets. Each point corresponds to a (model, dataset, training set size) triplet. Medical dataset results are shown in blue shades, and natural dataset results are shown in red; note the difference in generalization error scaling rate between the two imaging domains. Standard deviation error bars are shown for natural image datasets for 5 different class pairs.
Figure 3: Test set loss penalty due to FGSM adversarial attack vs. measured dataset label sharpness ($\hat{K}_\mathcal{F}$) for models trained on natural and medical image datasets (orange and blue points, respectively). Pearson correlation coefficient $r$ also shown. Error bars are $95\%$ confidence intervals over all training set sizes $N$ for the same dataset.
Figure 4: Scaling of log test set loss/generalization ability with the intrinsic dimension of final hidden layer learned representations of the training set ($d_\mathrm{repr}$), for natural and medical datasets. Each point corresponds to a (model, dataset, training set size) triplet. Medical dataset results are shown in blue shades, and natural dataset results are shown in red.
Figure 5: Training set intrinsic dimension upper-bounds learned representation intrinsic dimension. Each point corresponds to a (model, dataset, training set size) triplet.
...and 22 more figures

Theorems & Definitions (11)

Theorem 1: Generalization Error and Dataset Intrinsic Dim. Scaling Law bahri2021explaining
Theorem 2: Approximating $K_f$ with $K_\mathcal{F}$
Theorem 3: Adversarial Robustness and Label Sharpness Scaling Law
proof
Theorem 4: Generalization Error and Learned Representation Intrinsic Dimension Scaling Law
Theorem 5: Bounding of Representation Intrinsic Dim. with Dataset Intrinsic Dim.
proof
proof
proof
proof
...and 1 more

The Effect of Intrinsic Dataset Properties on Generalization: Unraveling Learning Differences Between Natural and Medical Images

TL;DR

Abstract

The Effect of Intrinsic Dataset Properties on Generalization: Unraveling Learning Differences Between Natural and Medical Images

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (27)

Theorems & Definitions (11)