Table of Contents
Fetching ...

Neural Redshift: Random Networks are not Random Functions

Damien Teney, Armand Nicolicioiu, Valentin Hartmann, Ehsan Abbasnejad

TL;DR

This paper proposes Neural Redshift, an architecture-driven view of neural generalization that eschews sole reliance on gradient-based biases. By sampling random networks and evaluating them on input grids, the authors quantify a threefold notion of simplicity—low frequency, low polynomial order, and high compressibility—using Fourier analyses, polynomial decompositions, and Lempel-Ziv compression. They show that common architectural choices, notably ReLU activations, residual connections, and layer normalization, bias networks toward these simple functions, and that this bias persists in trained models, shaping generalization; conversely, by modulating weight magnitudes or using alternative activations, the bias can be shifted toward higher complexity, enabling learning of more complex tasks. The findings extend to transformers, suggesting that their inductive biases toward compressible sequences originate from their building blocks and are not universal across architectures. Altogether, the work provides a gradient-free lens on deep learning success and suggests practical routes to controlling the solutions selected by training.

Abstract

Our understanding of the generalization capabilities of neural networks (NNs) is still incomplete. Prevailing explanations are based on implicit biases of gradient descent (GD) but they cannot account for the capabilities of models from gradient-free methods nor the simplicity bias recently observed in untrained networks. This paper seeks other sources of generalization in NNs. Findings. To understand the inductive biases provided by architectures independently from GD, we examine untrained, random-weight networks. Even simple MLPs show strong inductive biases: uniform sampling in weight space yields a very biased distribution of functions in terms of complexity. But unlike common wisdom, NNs do not have an inherent "simplicity bias". This property depends on components such as ReLUs, residual connections, and layer normalizations. Alternative architectures can be built with a bias for any level of complexity. Transformers also inherit all these properties from their building blocks. Implications. We provide a fresh explanation for the success of deep learning independent from gradient-based training. It points at promising avenues for controlling the solutions implemented by trained models.

Neural Redshift: Random Networks are not Random Functions

TL;DR

This paper proposes Neural Redshift, an architecture-driven view of neural generalization that eschews sole reliance on gradient-based biases. By sampling random networks and evaluating them on input grids, the authors quantify a threefold notion of simplicity—low frequency, low polynomial order, and high compressibility—using Fourier analyses, polynomial decompositions, and Lempel-Ziv compression. They show that common architectural choices, notably ReLU activations, residual connections, and layer normalization, bias networks toward these simple functions, and that this bias persists in trained models, shaping generalization; conversely, by modulating weight magnitudes or using alternative activations, the bias can be shifted toward higher complexity, enabling learning of more complex tasks. The findings extend to transformers, suggesting that their inductive biases toward compressible sequences originate from their building blocks and are not universal across architectures. Altogether, the work provides a gradient-free lens on deep learning success and suggests practical routes to controlling the solutions selected by training.

Abstract

Our understanding of the generalization capabilities of neural networks (NNs) is still incomplete. Prevailing explanations are based on implicit biases of gradient descent (GD) but they cannot account for the capabilities of models from gradient-free methods nor the simplicity bias recently observed in untrained networks. This paper seeks other sources of generalization in NNs. Findings. To understand the inductive biases provided by architectures independently from GD, we examine untrained, random-weight networks. Even simple MLPs show strong inductive biases: uniform sampling in weight space yields a very biased distribution of functions in terms of complexity. But unlike common wisdom, NNs do not have an inherent "simplicity bias". This property depends on components such as ReLUs, residual connections, and layer normalizations. Alternative architectures can be built with a bias for any level of complexity. Transformers also inherit all these properties from their building blocks. Implications. We provide a fresh explanation for the success of deep learning independent from gradient-based training. It points at promising avenues for controlling the solutions implemented by trained models.
Paper Structure (72 sections, 21 figures, 1 table)

This paper contains 72 sections, 21 figures, 1 table.

Figures (21)

  • Figure 1: We examine the complexity of the functions implemented by various MLP architectures. We find that much of their generalization capabilities can be understood independently from the optimization, training objective, scaling, or even data distribution. For example, ReLU and GELU networks (left) overwhelmingly represent low-frequency functions for any network depth or weight magnitude. Other activations lack this property.
  • Figure 2: Our methodology to characterize the inductive biases of an architecture. We evaluate a network with random weights/biases on a grid of points. This yields a representation of the function implemented by the network, shown here as a grayscale image for a 2D input. We then characterize this function using three measures of complexity.
  • Figure 3: Comparison of functions implemented by random MLPs (2D input, 3 hidden layers). ReLU and TanH architectures are biased towards different functions despite their universal approximation capabilities. ReLU architectures have the unique property of maintaining their simplicity bias regardless of weight magnitude.
  • Figure 4: Heatmaps of the average Fourier complexity of functions implemented by random-weight networks. Each heatmap corresponds to an activation function and each cell (within a heatmap) corresponds to a depth (heatmap columns) and weight magnitude (heatmap rows). We also show grayscale images of functions implemented by networks of an architecture corresponding to every other heatmap cell.
  • Figure 5: The complexity of random models (Y axis) generally increases with weights / activations magnitudes (X axis). The sensitivity is however very different across activation functions. This sensitivity also increases with multiplicative interactions (i.e. gating), decreases with residual connections, and is essentially absent with layer normalization.
  • ...and 16 more figures