The Nonlinearity Coefficient - Predicting Generalization in Deep Neural Networks
George Philipp, Jaime G. Carbonell
TL;DR
The paper introduces the Nonlinearity Coefficient (NLC), a gradient-based metric that captures network nonlinearity and serves as a strong pre-training predictor of generalization across architectures. It formalizes the NLC and its theoretical ties to Jacobians and input/output covariances, and validates the approach with a large-scale empirical study spanning 750 random architectures on three datasets, identifying a narrow optimal NLC range (approximately 1–3) for generalization. The work demonstrates the NLC’s robustness to confounders like input scaling, bias, and width, and highlights the roles of output bias and skip connections in shaping performance. It also connects activation nonlinearity to NLC through activation-specific measures and discusses practical implications for architecture search, while outlining future extensions to CNNs and broader questions such as robustness and efficiency.
Abstract
For a long time, designing neural architectures that exhibit high performance was considered a dark art that required expert hand-tuning. One of the few well-known guidelines for architecture design is the avoidance of exploding gradients, though even this guideline has remained relatively vague and circumstantial. We introduce the nonlinearity coefficient (NLC), a measurement of the complexity of the function computed by a neural network that is based on the magnitude of the gradient. Via an extensive empirical study, we show that the NLC is a powerful predictor of test error and that attaining a right-sized NLC is essential for optimal performance. The NLC exhibits a range of intriguing and important properties. It is closely tied to the amount of information gained from computing a single network gradient. It is tied to the error incurred when replacing the nonlinearity operations in the network with linear operations. It is not susceptible to the confounders of multiplicative scaling, additive bias and layer width. It is stable from layer to layer. Hence, we argue that the NLC is the first robust predictor of overfitting in deep networks.
