Table of Contents
Fetching ...

The Local Learning Coefficient: A Singularity-Aware Complexity Measure

Edmund Lau, Zach Furman, George Wang, Daniel Murfet, Susan Wei

TL;DR

The paper addresses the inadequacy of traditional, parameter-count-based complexity measures for deep neural networks due to singularities in the loss landscape. It introduces the Local Learning Coefficient (LLC), a singularity-aware complexity measure grounded in Singular Learning Theory, and derives a local volume scaling $V(\epsilon) \propto \epsilon^{\lambda(w^*)} (-\log \epsilon)^{m(w^*)-1}$ that governs the near-minimum geometry, as well as a local free energy expansion $F_n(B_\gamma(w^*)) = n L_n(w^*) + \lambda(w^*) \log n + o_p(\log \log n)$. The authors develop a scalable LLC estimator through a practical surrogate with a localizing prior and a SGLD-based sampling scheme, and validate it empirically on deep linear networks up to $10^8$ parameters, ResNet models on CIFAR-10, and transformer language models, showing training heuristics meaningfully modulate LLC. They demonstrate that LLC decreases with stronger implicit regularization and can reveal the effective simplicity of trained networks, offering a principled framework to reconcile deep networks' apparent complexity with parsimony. The work provides a novel, data-distribution-sensitive lens for model selection and understanding training dynamics, with potential to illuminate phase transitions and emergent abilities in large-scale models.

Abstract

The Local Learning Coefficient (LLC) is introduced as a novel complexity measure for deep neural networks (DNNs). Recognizing the limitations of traditional complexity measures, the LLC leverages Singular Learning Theory (SLT), which has long recognized the significance of singularities in the loss landscape geometry. This paper provides an extensive exploration of the LLC's theoretical underpinnings, offering both a clear definition and intuitive insights into its application. Moreover, we propose a new scalable estimator for the LLC, which is then effectively applied across diverse architectures including deep linear networks up to 100M parameters, ResNet image models, and transformer language models. Empirical evidence suggests that the LLC provides valuable insights into how training heuristics might influence the effective complexity of DNNs. Ultimately, the LLC emerges as a crucial tool for reconciling the apparent contradiction between deep learning's complexity and the principle of parsimony.

The Local Learning Coefficient: A Singularity-Aware Complexity Measure

TL;DR

The paper addresses the inadequacy of traditional, parameter-count-based complexity measures for deep neural networks due to singularities in the loss landscape. It introduces the Local Learning Coefficient (LLC), a singularity-aware complexity measure grounded in Singular Learning Theory, and derives a local volume scaling that governs the near-minimum geometry, as well as a local free energy expansion . The authors develop a scalable LLC estimator through a practical surrogate with a localizing prior and a SGLD-based sampling scheme, and validate it empirically on deep linear networks up to parameters, ResNet models on CIFAR-10, and transformer language models, showing training heuristics meaningfully modulate LLC. They demonstrate that LLC decreases with stronger implicit regularization and can reveal the effective simplicity of trained networks, offering a principled framework to reconcile deep networks' apparent complexity with parsimony. The work provides a novel, data-distribution-sensitive lens for model selection and understanding training dynamics, with potential to illuminate phase transitions and emergent abilities in large-scale models.

Abstract

The Local Learning Coefficient (LLC) is introduced as a novel complexity measure for deep neural networks (DNNs). Recognizing the limitations of traditional complexity measures, the LLC leverages Singular Learning Theory (SLT), which has long recognized the significance of singularities in the loss landscape geometry. This paper provides an extensive exploration of the LLC's theoretical underpinnings, offering both a clear definition and intuitive insights into its application. Moreover, we propose a new scalable estimator for the LLC, which is then effectively applied across diverse architectures including deep linear networks up to 100M parameters, ResNet image models, and transformer language models. Empirical evidence suggests that the LLC provides valuable insights into how training heuristics might influence the effective complexity of DNNs. Ultimately, the LLC emerges as a crucial tool for reconciling the apparent contradiction between deep learning's complexity and the principle of parsimony.
Paper Structure (46 sections, 1 theorem, 51 equations, 14 figures, 1 table, 1 algorithm)

This paper contains 46 sections, 1 theorem, 51 equations, 14 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Let $r := \mathrm{rank}\left(W_M^* \dots W_2^* W_1^*\right)$ be the rank of the linear transformation implemented by the true DLN, $f(x, w)$ and set $\Delta_j := H_j - r$, for $j = 0, \dots, M$. There exist a subset $\Sigma \subset \{0, 1, \dots, M\}$ of indices, $\Sigma = \{\sigma_1, \dots, \sigma_ Assuming that the DLN truth-model pair $\left(q(x, y), p(x, y |w)\right)$ satisfies the relatively

Figures (14)

  • Figure 1: Impact of SGD learning rate (top), batch size (middle) and momentum (bottom) when training ResNet18 on CIFAR10. We plot the LLC estimate (left), test accuracy (middle) and train loss (right) across training time. As the strength of the implicit regularization increases --- through higher learning rate, lower batch size and higher momentum --- LLC decreases (the network gets "simpler") and test accuracy increases. Even though most training losses collapse to zero, the LLC can discern the implicit regularization pressure applied by various training heuristics.
  • Figure 2: The LLC ${\lambda({w^*})}$ measures volume scaling around a local minimum of the loss. Top: the area (or "2D-volume") under the loss $L(w_1, w_2) = {w_1}^2 + {w_2}^2$ scales as $V(\epsilon) \propto \epsilon$ around ${w^*} = (0,0)$, so ${\lambda({w^*})} = 1$. Bottom: the area under the loss $L(w_1, w_2) = {w_1}^2 {w_2}^4$ scales as $V(\epsilon) \propto \epsilon^{1/4}$ around ${w^*} = (0,0)$, so ${\lambda({w^*})} = \frac{1}{4}$. Lower ${\lambda({w^*})}$ means more volume: as we approach the minimum ($\epsilon \rightarrow 0$), there is much more volume available for the bottom loss function compared to the top for any given small tolerance $\epsilon$. Reproduced with permission from hoogland2023.
  • Figure 3: Estimated LLC against true learning coefficient; model dimension shown in color. On the left, we evaluate the LLC estimator at a global minimum, ${w^*}$, of the population loss. On the right, we evaluate the LLC estimator at a minimum, $\hat{w}_n^*$, found by SGD. Fortunately, we do not see an adverse effect of using the training data twice, a minor concern we had raised at the end of Section \ref{['sec:llc_estimator']}. The estimated LLCs accurately measures the learning coefficient $\lambda$ up to 100 million parameters in deep linear networks, as compared to known theoretical values (dashed line). See Figure \ref{['fig:lambdahat-vs-lambda linear scale']} for linear-scale plots.
  • Figure H.1: Sample loss trace (blue, left axis) and MALA acceptance probability (red, right axis) over DLN training trajectories at different model sizes.
  • Figure I.1: The top graph shows $\lambda$ decreasing as the DLN becomes deeper, even though model parameter count increases with number of layers. The bottom graph shows the true multiplicities, $m$. Since regular models can only have $m = 1$, the graph shows that most of these randomly generated DLNs are singular.
  • ...and 9 more figures

Theorems & Definitions (4)

  • Definition 1: The Local Learning Coefficient (LLC), ${\lambda({w^*})}$
  • Definition 2: Local Learning Coefficient (LLC) estimator
  • Definition 3
  • Theorem 1: DLN learning coefficient, aoyagi2024