Mitigating the Curse of Detail: Scaling Arguments for Feature Learning and Sample Complexity
Noa Rubin, Orit Davidovich, Zohar Ringel
TL;DR
The paper tackles the challenge of predicting when and how feature learning emerges in deep networks by focusing on scalable, first-principles arguments rather than detailed, architecture-specific analyses. It builds a Bayesian-LDT framework to bound the minimal training set size P_* via an alignment-based energy E(alpha) and then develops a variational approach to estimate E(alpha) by comparing a small set of feature-learning patterns (GP, GFL, specialization) and their propagation through layers. The authors apply this machinery to three-layer networks and softmax-attention heads, recovering known scaling exponents and making novel predictions about pattern transitions and the growth of specializing neurons. Overall, the work provides a tractable, principled route to anticipate data and width scales across architectures, linking kernel dynamics to emergent learning patterns and informing hyperparameter transfer in practice.
Abstract
Two pressing topics in the theory of deep learning are the interpretation of feature learning mechanisms and the determination of implicit bias of networks in the rich regime. Current theories of rich feature learning, often appear in the form of high-dimensional non-linear equations, which require computationally intensive numerical solutions. Given the many details that go into defining a deep learning problem, this complexity is a significant and often unavoidable challenge. Here, we propose a powerful heuristic route for predicting the data and width scales at which various patterns of feature learning emerge. This form of scale analysis is considerably simpler than exact theories and reproduces the scaling exponents of various known results. In addition, we make novel predictions on complex toy architectures, such as three-layer non-linear networks and attention heads, thus extending the scope of first-principle theories of deep learning.
