Table of Contents
Fetching ...

The Underlying Scaling Laws and Universal Statistical Structure of Complex Datasets

Noam Levi, Yaron Oz

TL;DR

The paper investigates universal statistical structure in complex datasets by treating data as a physical system and applying Random Matrix Theory to the Gram matrix $\Sigma_M = \tfrac{1}{M} X X^T$. It shows that real-world data and correlated Gaussian datasets share GOE-like bulk statistics, which can be captured by a simple Toeplitz-correlated Wishart model (CGD) and approximated even with moderate sample sizes $M_{\mathrm{crit}} \sim d$. A single bulk scaling exponent $\alpha$ governs the power-law eigenvalue decay, with $\lambda_i \propto i^{-1-\alpha}$, and the Shannon entropy of the spectrum correlates with local RMT structure, being lower for strongly correlated data. The results imply that Gram matrices of natural images are well described by Wishart ensembles with simple covariance, enabling rigorous analyses of neural network dynamics and generalization. The framework provides a bridge between real data complexity and tractable RMT models, with potential extension to multiple modalities and learning dynamics beyond random feature models.

Abstract

We study universal traits which emerge both in real-world complex datasets, as well as in artificially generated ones. Our approach is to analogize data to a physical system and employ tools from statistical physics and Random Matrix Theory (RMT) to reveal their underlying structure. We focus on the feature-feature covariance matrix, analyzing both its local and global eigenvalue statistics. Our main observations are: (i) The power-law scalings that the bulk of its eigenvalues exhibit are vastly different for uncorrelated normally distributed data compared to real-world data, (ii) this scaling behavior can be completely modeled by generating Gaussian data with long range correlations, (iii) both generated and real-world datasets lie in the same universality class from the RMT perspective, as chaotic rather than integrable systems, (iv) the expected RMT statistical behavior already manifests for empirical covariance matrices at dataset sizes significantly smaller than those conventionally used for real-world training, and can be related to the number of samples required to approximate the population power-law scaling behavior, (v) the Shannon entropy is correlated with local RMT structure and eigenvalues scaling, is substantially smaller in strongly correlated datasets compared to uncorrelated ones, and requires fewer samples to reach the distribution entropy. These findings show that with sufficient sample size, the Gram matrix of natural image datasets can be well approximated by a Wishart random matrix with a simple covariance structure, opening the door to rigorous studies of neural network dynamics and generalization which rely on the data Gram matrix.

The Underlying Scaling Laws and Universal Statistical Structure of Complex Datasets

TL;DR

The paper investigates universal statistical structure in complex datasets by treating data as a physical system and applying Random Matrix Theory to the Gram matrix . It shows that real-world data and correlated Gaussian datasets share GOE-like bulk statistics, which can be captured by a simple Toeplitz-correlated Wishart model (CGD) and approximated even with moderate sample sizes . A single bulk scaling exponent governs the power-law eigenvalue decay, with , and the Shannon entropy of the spectrum correlates with local RMT structure, being lower for strongly correlated data. The results imply that Gram matrices of natural images are well described by Wishart ensembles with simple covariance, enabling rigorous analyses of neural network dynamics and generalization. The framework provides a bridge between real data complexity and tractable RMT models, with potential extension to multiple modalities and learning dynamics beyond random feature models.

Abstract

We study universal traits which emerge both in real-world complex datasets, as well as in artificially generated ones. Our approach is to analogize data to a physical system and employ tools from statistical physics and Random Matrix Theory (RMT) to reveal their underlying structure. We focus on the feature-feature covariance matrix, analyzing both its local and global eigenvalue statistics. Our main observations are: (i) The power-law scalings that the bulk of its eigenvalues exhibit are vastly different for uncorrelated normally distributed data compared to real-world data, (ii) this scaling behavior can be completely modeled by generating Gaussian data with long range correlations, (iii) both generated and real-world datasets lie in the same universality class from the RMT perspective, as chaotic rather than integrable systems, (iv) the expected RMT statistical behavior already manifests for empirical covariance matrices at dataset sizes significantly smaller than those conventionally used for real-world training, and can be related to the number of samples required to approximate the population power-law scaling behavior, (v) the Shannon entropy is correlated with local RMT structure and eigenvalues scaling, is substantially smaller in strongly correlated datasets compared to uncorrelated ones, and requires fewer samples to reach the distribution entropy. These findings show that with sufficient sample size, the Gram matrix of natural image datasets can be well approximated by a Wishart random matrix with a simple covariance structure, opening the door to rigorous studies of neural network dynamics and generalization which rely on the data Gram matrix.
Paper Structure (26 sections, 37 equations, 9 figures)

This paper contains 26 sections, 37 equations, 9 figures.

Figures (9)

  • Figure 1: Left: Scree plot of $\Sigma_{ij,M}$ for several different vision datasets, as well as for UGD and a CGD with fixed $\alpha$. Here, the number of samples is taken to be the entire dataset for each real-world dataset, and $M=50\mathrm{k}$ for the gaussian data, where we set $c=1$. We see a clear scaling law for the eigenvalue bulk as $\lambda_i \propto i^{-1-\alpha}$ where all real-world datasets display $\alpha \leq 1/2$. Right: The power-law scaling parameter $\alpha$ value can be tuned from $\alpha = 1/4$ to $\alpha =-1$ by corrupting the FMNIST dataset with a varying amount of normally distributed noise.
  • Figure 2: Top row: Scree plot of $\Sigma_{ij,M}$ for several different configurations and datasets. We show the eigenvalues of the population covariance matrix $\Sigma^\mathrm{Toe}$, the eigenvalues for the empirical covariance of the full real-world dataset with $M=50$k and finally the eigenvalues of the empirical covariance using the same $\Sigma^\mathrm{Toe}$, with $M=50k$. The datasets used here are (left to right): FMNIST, CIFAR10, ImageNet. Bottom row: Spectral density for the bulk of eigenvalues for the same datasets, as well as a comparison against UGD of the same dimensions. The $\bar{\lambda}$ indicates normalization over the maximal eigenvalue among the bulk. We also provide the KL divergence between the CGDs and the real-world data distributions.
  • Figure 3: The $r$ probability density ( left), the unfolded level spacing distribution ( center) and the spectral form factor ( right) of $\Sigma_M$ for FMNIST, CIFAR10, their CGDs, and UGD, obtained with $M=50000$. Black curves indicate the RMT predictions for the GOE distribution from \ref{['eq:RMT_prediction']}. These results indicate that the bulk of real-world data eigenvalues belongs to the GOE universality class, and that system has enough statistics to converge to the RMT predictions.
  • Figure 4: Left: The $r$ distance metric $\delta(M)$ for the bulk of eigenvalues. Center: The $\alpha$ distance metric $\Delta(M)$ for the bulk of eigenvalues. Right: The full matrix comparison metric $\epsilon(M)$. We show the results for CIFAR10, FMNIST, UGD, and the FMNIST CGD as a function of the number of samples. The results show that the bulk distances decrease as $1/M$, where $M$ is the number of samples, asymptoting to a constant value at similar values of $M_\mathrm{crit}\sim d$ ( black dashed), where $d$ is the number of features.
  • Figure 5: Convergence of the various metrics in \ref{['eq:r_scaling', 'eq:alpha_scaling', 'eq:sigma_scaling']} in relation to entropy for the bulk of eigenvalues. Left: The Shannon entropy $H_M$ as a function of the dataset size $M$. Center: Convergence of the normalized $\alpha$ metric $\Delta_M/\Delta$ to its asymptotic value as a function of the normalized entropy $H_M/H$. Right: Convergence of the normalized $r$ statistics metric $\delta_M/\delta$ to its asymptotic value as a function of the normalized entropy $H_M/H$. We show the results for CIFAR10, FMNIST, MNIST, UGD, and the FMNIST CGD.
  • ...and 4 more figures