Neural Redshift: Random Networks are not Random Functions
Damien Teney, Armand Nicolicioiu, Valentin Hartmann, Ehsan Abbasnejad
TL;DR
This paper proposes Neural Redshift, an architecture-driven view of neural generalization that eschews sole reliance on gradient-based biases. By sampling random networks and evaluating them on input grids, the authors quantify a threefold notion of simplicity—low frequency, low polynomial order, and high compressibility—using Fourier analyses, polynomial decompositions, and Lempel-Ziv compression. They show that common architectural choices, notably ReLU activations, residual connections, and layer normalization, bias networks toward these simple functions, and that this bias persists in trained models, shaping generalization; conversely, by modulating weight magnitudes or using alternative activations, the bias can be shifted toward higher complexity, enabling learning of more complex tasks. The findings extend to transformers, suggesting that their inductive biases toward compressible sequences originate from their building blocks and are not universal across architectures. Altogether, the work provides a gradient-free lens on deep learning success and suggests practical routes to controlling the solutions selected by training.
Abstract
Our understanding of the generalization capabilities of neural networks (NNs) is still incomplete. Prevailing explanations are based on implicit biases of gradient descent (GD) but they cannot account for the capabilities of models from gradient-free methods nor the simplicity bias recently observed in untrained networks. This paper seeks other sources of generalization in NNs. Findings. To understand the inductive biases provided by architectures independently from GD, we examine untrained, random-weight networks. Even simple MLPs show strong inductive biases: uniform sampling in weight space yields a very biased distribution of functions in terms of complexity. But unlike common wisdom, NNs do not have an inherent "simplicity bias". This property depends on components such as ReLUs, residual connections, and layer normalizations. Alternative architectures can be built with a bias for any level of complexity. Transformers also inherit all these properties from their building blocks. Implications. We provide a fresh explanation for the success of deep learning independent from gradient-based training. It points at promising avenues for controlling the solutions implemented by trained models.
