The Local Learning Coefficient: A Singularity-Aware Complexity Measure

Edmund Lau; Zach Furman; George Wang; Daniel Murfet; Susan Wei

The Local Learning Coefficient: A Singularity-Aware Complexity Measure

Edmund Lau, Zach Furman, George Wang, Daniel Murfet, Susan Wei

TL;DR

The paper addresses the inadequacy of traditional, parameter-count-based complexity measures for deep neural networks due to singularities in the loss landscape. It introduces the Local Learning Coefficient (LLC), a singularity-aware complexity measure grounded in Singular Learning Theory, and derives a local volume scaling $V(\epsilon) \propto \epsilon^{\lambda(w^*)} (-\log \epsilon)^{m(w^*)-1}$ that governs the near-minimum geometry, as well as a local free energy expansion $F_n(B_\gamma(w^*)) = n L_n(w^*) + \lambda(w^*) \log n + o_p(\log \log n)$. The authors develop a scalable LLC estimator through a practical surrogate with a localizing prior and a SGLD-based sampling scheme, and validate it empirically on deep linear networks up to $10^8$ parameters, ResNet models on CIFAR-10, and transformer language models, showing training heuristics meaningfully modulate LLC. They demonstrate that LLC decreases with stronger implicit regularization and can reveal the effective simplicity of trained networks, offering a principled framework to reconcile deep networks' apparent complexity with parsimony. The work provides a novel, data-distribution-sensitive lens for model selection and understanding training dynamics, with potential to illuminate phase transitions and emergent abilities in large-scale models.

Abstract

The Local Learning Coefficient (LLC) is introduced as a novel complexity measure for deep neural networks (DNNs). Recognizing the limitations of traditional complexity measures, the LLC leverages Singular Learning Theory (SLT), which has long recognized the significance of singularities in the loss landscape geometry. This paper provides an extensive exploration of the LLC's theoretical underpinnings, offering both a clear definition and intuitive insights into its application. Moreover, we propose a new scalable estimator for the LLC, which is then effectively applied across diverse architectures including deep linear networks up to 100M parameters, ResNet image models, and transformer language models. Empirical evidence suggests that the LLC provides valuable insights into how training heuristics might influence the effective complexity of DNNs. Ultimately, the LLC emerges as a crucial tool for reconciling the apparent contradiction between deep learning's complexity and the principle of parsimony.

The Local Learning Coefficient: A Singularity-Aware Complexity Measure

TL;DR

that governs the near-minimum geometry, as well as a local free energy expansion

. The authors develop a scalable LLC estimator through a practical surrogate with a localizing prior and a SGLD-based sampling scheme, and validate it empirically on deep linear networks up to

parameters, ResNet models on CIFAR-10, and transformer language models, showing training heuristics meaningfully modulate LLC. They demonstrate that LLC decreases with stronger implicit regularization and can reveal the effective simplicity of trained networks, offering a principled framework to reconcile deep networks' apparent complexity with parsimony. The work provides a novel, data-distribution-sensitive lens for model selection and understanding training dynamics, with potential to illuminate phase transitions and emergent abilities in large-scale models.

Abstract

Paper Structure (46 sections, 1 theorem, 51 equations, 14 figures, 1 table, 1 algorithm)

This paper contains 46 sections, 1 theorem, 51 equations, 14 figures, 1 table, 1 algorithm.

Introduction
Setup
The local learning coefficient
Complexity via counting low loss parameters
LLC estimation
Idealized LLC estimator
Surrogate for enforcing $B_\gamma(w^*)$
The LLC estimator
The SGLD-based LLC estimator
Experiments
LLC for Deep Linear Networks (DLNs)
LLC for ResNet
Related work
Outlook
Background on Singular Learning Theory
...and 31 more sections

Key Result

Theorem 1

Let $r := \mathrm{rank}\left(W_M^* \dots W_2^* W_1^*\right)$ be the rank of the linear transformation implemented by the true DLN, $f(x, w)$ and set $\Delta_j := H_j - r$, for $j = 0, \dots, M$. There exist a subset $\Sigma \subset \{0, 1, \dots, M\}$ of indices, $\Sigma = \{\sigma_1, \dots, \sigma_ Assuming that the DLN truth-model pair $\left(q(x, y), p(x, y |w)\right)$ satisfies the relatively

Figures (14)

Figure 1: Impact of SGD learning rate (top), batch size (middle) and momentum (bottom) when training ResNet18 on CIFAR10. We plot the LLC estimate (left), test accuracy (middle) and train loss (right) across training time. As the strength of the implicit regularization increases --- through higher learning rate, lower batch size and higher momentum --- LLC decreases (the network gets "simpler") and test accuracy increases. Even though most training losses collapse to zero, the LLC can discern the implicit regularization pressure applied by various training heuristics.
Figure 2: The LLC ${\lambda({w^*})}$ measures volume scaling around a local minimum of the loss. Top: the area (or "2D-volume") under the loss $L(w_1, w_2) = {w_1}^2 + {w_2}^2$ scales as $V(\epsilon) \propto \epsilon$ around ${w^*} = (0,0)$, so ${\lambda({w^*})} = 1$. Bottom: the area under the loss $L(w_1, w_2) = {w_1}^2 {w_2}^4$ scales as $V(\epsilon) \propto \epsilon^{1/4}$ around ${w^*} = (0,0)$, so ${\lambda({w^*})} = \frac{1}{4}$. Lower ${\lambda({w^*})}$ means more volume: as we approach the minimum ($\epsilon \rightarrow 0$), there is much more volume available for the bottom loss function compared to the top for any given small tolerance $\epsilon$. Reproduced with permission from hoogland2023.
Figure 3: Estimated LLC against true learning coefficient; model dimension shown in color. On the left, we evaluate the LLC estimator at a global minimum, ${w^*}$, of the population loss. On the right, we evaluate the LLC estimator at a minimum, $\hat{w}_n^*$, found by SGD. Fortunately, we do not see an adverse effect of using the training data twice, a minor concern we had raised at the end of Section \ref{['sec:llc_estimator']}. The estimated LLCs accurately measures the learning coefficient $\lambda$ up to 100 million parameters in deep linear networks, as compared to known theoretical values (dashed line). See Figure \ref{['fig:lambdahat-vs-lambda linear scale']} for linear-scale plots.
Figure H.1: Sample loss trace (blue, left axis) and MALA acceptance probability (red, right axis) over DLN training trajectories at different model sizes.
Figure I.1: The top graph shows $\lambda$ decreasing as the DLN becomes deeper, even though model parameter count increases with number of layers. The bottom graph shows the true multiplicities, $m$. Since regular models can only have $m = 1$, the graph shows that most of these randomly generated DLNs are singular.
...and 9 more figures

Theorems & Definitions (4)

Definition 1: The Local Learning Coefficient (LLC), ${\lambda({w^*})}$
Definition 2: Local Learning Coefficient (LLC) estimator
Definition 3
Theorem 1: DLN learning coefficient, aoyagi2024

The Local Learning Coefficient: A Singularity-Aware Complexity Measure

TL;DR

Abstract

The Local Learning Coefficient: A Singularity-Aware Complexity Measure

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (4)