Table of Contents
Fetching ...

Scaling Laws for Uncertainty in Deep Learning

Mattia Rosso, Simone Rossi, Giulio Franzese, Markus Heinonen, Maurizio Filippone

TL;DR

This work investigates whether predictive uncertainties in deep learning obey scaling laws with dataset size $N$ and model size $P$. It combines a broad empirical study—across CNNs, ViTs, and language models using Monte Carlo dropout, Gaussian approximations, MCMC, deep ensembles, and partially stochastic networks—with theoretical insights linking generalization and uncertainty in identifiable parametric models. The authors demonstrate robust power-law scaling for in- and out-of-distribution uncertainties and provide a Bayesian-linear-regression-based account showing $O(1/N)$ contraction in identifiable settings, while overparameterization yields more complex, architecture- and method-dependent behavior. They discuss practical implications for extrapolating uncertainties to larger data and models and highlight limitations of entropy-based uncertainty metrics and the need for optimization-aware theory and singular-learning perspectives.

Abstract

Deep learning has recently revealed the existence of scaling laws, demonstrating that model performance follows predictable trends based on dataset and model sizes. Inspired by these findings and fascinating phenomena emerging in the over-parameterized regime, we examine a parallel direction: do similar scaling laws govern predictive uncertainties in deep learning? In identifiable parametric models, such scaling laws can be derived in a straightforward manner by treating model parameters in a Bayesian way. In this case, for example, we obtain $O(1/N)$ contraction rates for epistemic uncertainty with respect to the number of data $N$. However, in over-parameterized models, these guarantees do not hold, leading to largely unexplored behaviors. In this work, we empirically show the existence of scaling laws associated with various measures of predictive uncertainty with respect to dataset and model sizes. Through experiments on vision and language tasks, we observe such scaling laws for in- and out-of-distribution predictive uncertainty estimated through popular approximate Bayesian inference and ensemble methods. Besides the elegance of scaling laws and the practical utility of extrapolating uncertainties to larger data or models, this work provides strong evidence to dispel recurring skepticism against Bayesian approaches: "In many applications of deep learning we have so much data available: what do we need Bayes for?". Our findings show that "so much data" is typically not enough to make epistemic uncertainty negligible.

Scaling Laws for Uncertainty in Deep Learning

TL;DR

This work investigates whether predictive uncertainties in deep learning obey scaling laws with dataset size and model size . It combines a broad empirical study—across CNNs, ViTs, and language models using Monte Carlo dropout, Gaussian approximations, MCMC, deep ensembles, and partially stochastic networks—with theoretical insights linking generalization and uncertainty in identifiable parametric models. The authors demonstrate robust power-law scaling for in- and out-of-distribution uncertainties and provide a Bayesian-linear-regression-based account showing contraction in identifiable settings, while overparameterization yields more complex, architecture- and method-dependent behavior. They discuss practical implications for extrapolating uncertainties to larger data and models and highlight limitations of entropy-based uncertainty metrics and the need for optimization-aware theory and singular-learning perspectives.

Abstract

Deep learning has recently revealed the existence of scaling laws, demonstrating that model performance follows predictable trends based on dataset and model sizes. Inspired by these findings and fascinating phenomena emerging in the over-parameterized regime, we examine a parallel direction: do similar scaling laws govern predictive uncertainties in deep learning? In identifiable parametric models, such scaling laws can be derived in a straightforward manner by treating model parameters in a Bayesian way. In this case, for example, we obtain contraction rates for epistemic uncertainty with respect to the number of data . However, in over-parameterized models, these guarantees do not hold, leading to largely unexplored behaviors. In this work, we empirically show the existence of scaling laws associated with various measures of predictive uncertainty with respect to dataset and model sizes. Through experiments on vision and language tasks, we observe such scaling laws for in- and out-of-distribution predictive uncertainty estimated through popular approximate Bayesian inference and ensemble methods. Besides the elegance of scaling laws and the practical utility of extrapolating uncertainties to larger data or models, this work provides strong evidence to dispel recurring skepticism against Bayesian approaches: "In many applications of deep learning we have so much data available: what do we need Bayes for?". Our findings show that "so much data" is typically not enough to make epistemic uncertainty negligible.

Paper Structure

This paper contains 45 sections, 42 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: Deep learning uncertainty is predictable, empirically. ResNet-18 Epistemic Uncertainty scaling with the number of training data $N$ on CIFAR-10.
  • Figure 2: ResNet-$\mathbf{d}$ uncertainty scaling on CIFAR-10 and CIFAR-100 datasets: We use (with fixed dropout rate $p=0.2$ and $p=0.5$); each point $\times$ corresponds to the average over $10$ independent folds (varying both data subsampling and model initialization). We consider $25\%$, $50\%$, $75\%$ and $100\%$ subsets of the training dataset. Dashed lines represent linear regressions fitted to the mean uncertainty metrics (, , ) on a fixed test set (see \ref{['sec:uq']}), following a power-law decay of the form $N^{\gamma_{TU}}$, $N^{\gamma_{AU}}$, and $N^{\gamma_{EU}}$. Both axes are shown on a logarithmic scale.
  • Figure 3: Impact of + on uncertainty scaling: ResNets on CIFAR-10 dataset using ($p = 0.5$) for . biases solutions towards flatter minima and the combination with preserves functional diversity as data size increases.
  • Figure 4: ResNets uncertainties out-of-distribution: We use with $p = 0.2$ in (a) and $p = 0.5$ in (b). For models trained on incrementally larger training subsets of CIFAR-10, we report the predictive uncertainties when testing on the (whole) CIFAR10-C dataset, averaged over all corruption levels ($1$-$5$) and corruption types considered. , we expect to observe larger uncertainties - , for instance, should decay gradually as the data space becomes increasingly populated with additional samples within the same domain (in-fill).
  • Figure 5: WideResNet -$\mathbf{w}$-$\mathbf{d}$ uncertainty scaling on CIFAR-10 dataset: We consider WideResNet-40-4 and WideResNet-28-10 and perform using with $M=5$ and $M=10$ ensemble members.
  • ...and 13 more figures