Table of Contents
Fetching ...

Mitigating the Curse of Detail: Scaling Arguments for Feature Learning and Sample Complexity

Noa Rubin, Orit Davidovich, Zohar Ringel

TL;DR

The paper tackles the challenge of predicting when and how feature learning emerges in deep networks by focusing on scalable, first-principles arguments rather than detailed, architecture-specific analyses. It builds a Bayesian-LDT framework to bound the minimal training set size P_* via an alignment-based energy E(alpha) and then develops a variational approach to estimate E(alpha) by comparing a small set of feature-learning patterns (GP, GFL, specialization) and their propagation through layers. The authors apply this machinery to three-layer networks and softmax-attention heads, recovering known scaling exponents and making novel predictions about pattern transitions and the growth of specializing neurons. Overall, the work provides a tractable, principled route to anticipate data and width scales across architectures, linking kernel dynamics to emergent learning patterns and informing hyperparameter transfer in practice.

Abstract

Two pressing topics in the theory of deep learning are the interpretation of feature learning mechanisms and the determination of implicit bias of networks in the rich regime. Current theories of rich feature learning, often appear in the form of high-dimensional non-linear equations, which require computationally intensive numerical solutions. Given the many details that go into defining a deep learning problem, this complexity is a significant and often unavoidable challenge. Here, we propose a powerful heuristic route for predicting the data and width scales at which various patterns of feature learning emerge. This form of scale analysis is considerably simpler than exact theories and reproduces the scaling exponents of various known results. In addition, we make novel predictions on complex toy architectures, such as three-layer non-linear networks and attention heads, thus extending the scope of first-principle theories of deep learning.

Mitigating the Curse of Detail: Scaling Arguments for Feature Learning and Sample Complexity

TL;DR

The paper tackles the challenge of predicting when and how feature learning emerges in deep networks by focusing on scalable, first-principles arguments rather than detailed, architecture-specific analyses. It builds a Bayesian-LDT framework to bound the minimal training set size P_* via an alignment-based energy E(alpha) and then develops a variational approach to estimate E(alpha) by comparing a small set of feature-learning patterns (GP, GFL, specialization) and their propagation through layers. The authors apply this machinery to three-layer networks and softmax-attention heads, recovering known scaling exponents and making novel predictions about pattern transitions and the growth of specializing neurons. Overall, the work provides a tractable, principled route to anticipate data and width scales across architectures, linking kernel dynamics to emergent learning patterns and informing hyperparameter transfer in practice.

Abstract

Two pressing topics in the theory of deep learning are the interpretation of feature learning mechanisms and the determination of implicit bias of networks in the rich regime. Current theories of rich feature learning, often appear in the form of high-dimensional non-linear equations, which require computationally intensive numerical solutions. Given the many details that go into defining a deep learning problem, this complexity is a significant and often unavoidable challenge. Here, we propose a powerful heuristic route for predicting the data and width scales at which various patterns of feature learning emerge. This form of scale analysis is considerably simpler than exact theories and reproduces the scaling exponents of various known results. In addition, we make novel predictions on complex toy architectures, such as three-layer non-linear networks and attention heads, thus extending the scope of first-principle theories of deep learning.

Paper Structure

This paper contains 47 sections, 125 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Logical flow of sample complexity derivation. Bounds: (i)- (iii) deriving lower bounds on sample complexity using LDT (Sec. \ref{['sec:bounds_alignment']}). Approximations: (iv) approximating the intractable lower bound (Sec. \ref{['sec:variational_analysis']}), Heuristics: (v)- (vi) providing heuristic methods for manually computing the approximated bound (Sec. \ref{['sec:heuristics']}). Each section is composed of intermediate steps as detailed in the diagram.
  • Figure 2: Numerical and experimental results for a two-layer erf network trained on the normalized third Hermite polynomial ($m=3$). In panel (a) we compare the experimental results and exact theoretical predictions (computed utilizing LDT, see App. \ref{['app:upper_bound_chern_FCN']}) for the distribution of the alignment of the hidden layer pre-activation with the linear feature. Here we follow the same notation as in ( \ref{['eq:H_def_main_text']}), so that $H_{\pi}$ is the negative log posterior of the preactivations up to an additive constant that enforces zero minimum. We also find the pre-activation distribution corresponds to $q(h)$ for $q\sim\text{M-Sp}$, as predicted by our heuristic approach. Panel (b) compares theoretical and experimental predictions for $P_*$, defined as alignment $\alpha>0.1$ (inset shows alignment as a function of sample size). Both theoretical and experimental results agree on $P_*\propto d$. In (c), we increase $N$ and keep $P$ and $d$ fixed, and plot the number of specialized neurons in the hidden layer. In agreement with our heuristic predictions, the number of neurons increases linearly with $\sqrt{N/d}$.
  • Figure 3: Sample complexity: Heuristic predictions accurately capture sample complexity in both three-layer erf FCNs and softmax attention heads, as well as feature learning scaling. Panels (a),(b) both track how the network alignment changes as a function of the ratio between the sample size, $P$, and the predicted sample complexity- $P/d$ for the FCN in panel (a), and $P/\sqrt{Ld^3}$ for the attention head in panel (b). In both cases, we observe that the alignment collapses onto a single curve, confirming the predicted sample complexity, where good alignment is achieved. See Fig. \ref{['appfig:mse']} for comparison to MSE. See \ref{['App:SoftMaxLayer']} and \ref{['app:3layer']} for experimental details. Feature learning patterns: Panel (c) tracks the number of linearly specialized neurons, in both the first (blue) and second (purple) layers as the first-layer width $N_1$ varies (with fixed $P, d$ and $N_2$). The number of first-layer specializing neurons initially follows the predicted $(N_1/d)^{(1/3)}$ scaling before the predicted transition occurs, where second-layer neurons begin to specialize on the linear feature rather than the cubic one, and the first layer neurons approach the GP distribution.
  • Figure 4: Schematic illustration of different candidate feature learning patterns per neuron.
  • Figure 5: Demonstration GFL feature propagation. We increased the variance of the $d/2$ highest kernel mode $\Phi_*(x)$ of the first hidden layer and then measured $\langle \Phi_* | K_{l=3} | \Phi_* \rangle^{-1}$ and $\langle \Phi_* | [K_{l=3}]^{-1} | \Phi_* \rangle$, where $K_{l=3}$ is the kernel of the subsequent layer. We used ReLU activations, $d=120,N_1=2000,N_2=1000$ and random Gaussian data. This demonstrates the expected $D^{-2}$ decay and the matching between inverse expectation values and the RKHS.
  • ...and 4 more figures