Understanding Artificial Neural Network's Behavior from Neuron Activation Perspective
Yizhou Zhang, Yang Sui
TL;DR
This work introduces an activation-centric probabilistic framework that models neuron activations as a stochastic process to explain neural scaling laws. By deriving a closed-form growth of the number of active (working) neurons, $K(D) \approx N\left(1-\left(\frac{bN}{cD+bN}\right)^b\right)$, and showing that working-neuron activations follow a power-law $P(D_i=k)\propto k^{-\alpha}$ with $0<\alpha<1$, the authors account for over-parameterization generalization, a phase-transition in generalization on log data axes, and a power-law loss decay with dataset size. They further predict unknown phenomena such as compression dynamics via the noise-to-signal ratio $\frac{N-K(D)}{N}=\left(\frac{bN}{cD+bN}\right)^b$, implying larger models are more compressible and that compression depends on both $N$ and $D$ in specific power-law forms. The framework yields testable predictions for parameter efficiency and pruning, bridging empirical neural-scaling observations with a concrete activation-based theory that can guide future experiments and model design.
Abstract
This paper explores the intricate behavior of deep neural networks (DNNs) through the lens of neuron activation dynamics. We propose a probabilistic framework that can analyze models' neuron activation patterns as a stochastic process, uncovering theoretical insights into neural scaling laws, such as over-parameterization and the power-law decay of loss with respect to dataset size. By deriving key mathematical relationships, we present that the number of activated neurons increases in the form of $N(1-(\frac{bN}{D+bN})^b)$, and the neuron activation should follows power-law distribution. Based on these two mathematical results, we demonstrate how DNNs maintain generalization capabilities even under over-parameterization, and we elucidate the phase transition phenomenon observed in loss curves as dataset size plotted in log-axis (i.e. the data magnitude increases linearly). Moreover, by combining the above two phenomenons and the power-law distribution of neuron activation, we derived the power-law decay of neural network's loss function as the data size scale increases. Furthermore, our analysis bridges the gap between empirical observations and theoretical underpinnings, offering experimentally testable predictions regarding parameter efficiency and model compressibility. These findings provide a foundation for understanding neural network scaling and present new directions for optimizing DNN performance.
