Table of Contents
Fetching ...

Measurement noise scaling laws for cellular representation learning

Gokul Gowri, Peng Yin, Allon M. Klein

TL;DR

The paper identifies measurement noise as a distinct scaling axis for representation learning, showing that performance degrades according to a logarithmic law with noise in both cellular and image data. It introduces an information-theoretic probing metric $I(f(Z); Y)$ and derives a simple Gaussian-noise model that yields an analytic noise-scaling form, producing data-collapse across models and datasets. Empirically, the authors demonstrate a saturating power-law with dataset size and a noise-driven curve that collapses to a universal form, with practical implications for data quality, model choice (favoring VAEs over transformers in noisier settings), and data collection strategies. The work suggests that measurement sensitivity can significantly enhance scalability and generalizes the scaling law to non-biological domains, offering a principled guide for data curation and experimental design. It also provides concrete procedures for estimating representation quality via mutual information and for predicting how much noise can be tolerated for a desired level of task informativeness.

Abstract

Deep learning scaling laws predict how performance improves with increased model and dataset size. Here we identify measurement noise in data as another performance scaling axis, governed by a distinct logarithmic law. We focus on representation learning models of biological single cell genomic data, where a dominant source of measurement noise is due to molecular undersampling. We introduce an information-theoretic metric for cellular representation model quality, and find that it scales with sampling depth. A single quantitative relationship holds across several model types and across several datasets. We show that the analytical form of this relationship can be derived from a simple Gaussian noise model, which in turn provides an intuitive interpretation for the scaling law. Finally, we show that the same relationship emerges in image classification models with respect to two types of imaging noise, suggesting that measurement noise scaling may be a general phenomenon. Scaling with noise can serve as a guide in generating and curating data for deep learning models, particularly in fields where measurement quality can vary dramatically between datasets.

Measurement noise scaling laws for cellular representation learning

TL;DR

The paper identifies measurement noise as a distinct scaling axis for representation learning, showing that performance degrades according to a logarithmic law with noise in both cellular and image data. It introduces an information-theoretic probing metric and derives a simple Gaussian-noise model that yields an analytic noise-scaling form, producing data-collapse across models and datasets. Empirically, the authors demonstrate a saturating power-law with dataset size and a noise-driven curve that collapses to a universal form, with practical implications for data quality, model choice (favoring VAEs over transformers in noisier settings), and data collection strategies. The work suggests that measurement sensitivity can significantly enhance scalability and generalizes the scaling law to non-biological domains, offering a principled guide for data curation and experimental design. It also provides concrete procedures for estimating representation quality via mutual information and for predicting how much noise can be tolerated for a desired level of task informativeness.

Abstract

Deep learning scaling laws predict how performance improves with increased model and dataset size. Here we identify measurement noise in data as another performance scaling axis, governed by a distinct logarithmic law. We focus on representation learning models of biological single cell genomic data, where a dominant source of measurement noise is due to molecular undersampling. We introduce an information-theoretic metric for cellular representation model quality, and find that it scales with sampling depth. A single quantitative relationship holds across several model types and across several datasets. We show that the analytical form of this relationship can be derived from a simple Gaussian noise model, which in turn provides an intuitive interpretation for the scaling law. Finally, we show that the same relationship emerges in image classification models with respect to two types of imaging noise, suggesting that measurement noise scaling may be a general phenomenon. Scaling with noise can serve as a guide in generating and curating data for deep learning models, particularly in fields where measurement quality can vary dramatically between datasets.

Paper Structure

This paper contains 46 sections, 2 theorems, 30 equations, 15 figures, 5 tables.

Key Result

Theorem 4.1

For the three variable Gaussian noise model specified above, In the special case where $n = 1$: where $\Sigma_Y=\sigma^2_Y, \Sigma_U=\sigma^2_U$.

Figures (15)

  • Figure 1: Measurement sensitivity and dataset size scaling experiments. a) Information gained above random projection (denoted $\Delta$MI) for models trained on various datasets, shown as a function of training dataset size (cell number, $\log_{10}$ scale), b) Representation information as a function of UMI per cell ($\log_{10}$ scale) for all tested models and datasets, including random projection.
  • Figure 2: Estimated $I_\infty$ values for models across UMI downsampling conditions. Each scatterplot point corresponds to an $I_\infty$ estimate from Eq. \ref{['eqn:sat-powerlaw']}. Dotted line denotes Eq. \ref{['eqn:scaling_general']} fit to the estimated $I_\infty$. Shaded region denotes $2\sigma$ confidence interval. Downsampling fraction refers to fraction of total counts preserved from original dataset.
  • Figure 3: Experimental data collapses on to scaling curves. a) Each scatterplot point corresponds to an empirically measured representation information for a PCA, VAE, or MobileNetv2 model trained on data of varying quality, with values rescaled to lie on nondimensionalized axes. Dotted line corresponds to Eq. \ref{['eqn:collapse']}. Models for which $\mathcal{I}_{\max}$ estimates are poorly constrained are omitted. (b) Example fits of Eq. \ref{['eqn:scaling_general']} for cellular representation learning models trained on full dataset size. Shaded region denotes $2\sigma$ confidence intervals. Individual fits for each dataset size are given in Appendix Fig. \ref{['fig:UMI_scaling_all']}.
  • Figure 4: Models extract information more slowly from noisier data. Each heatmap shows the mean information gained per $\log_{10}$ cells in one dataset for each model type. Downsampling fraction corresponds to the fraction of the measured mean UMIs per cell which are included in the artificially undersampled dataset.
  • Figure 5: Measurement noise scaling laws for image classifiers. Mutual information between true and predicted labels for MobileNetv2 trained on datasets subject to varying levels of noise. (left) Effect of downsampling image resolution (i.e. pixelation). (right) Effect of i.i.d. pixel-wise Gaussian noise.
  • ...and 10 more figures

Theorems & Definitions (3)

  • Theorem 4.1
  • Theorem 1.1: Theorem 3.1
  • proof