When Are Bias-Free ReLU Networks Effectively Linear Networks?

Yedi Zhang; Andrew Saxe; Peter E. Latham

When Are Bias-Free ReLU Networks Effectively Linear Networks?

Yedi Zhang, Andrew Saxe, Peter E. Latham

TL;DR

This work analyzes the consequences of removing bias terms in ReLU networks, showing that two-layer bias-free (or leaky) networks have restricted expressivity and, under symmetric data, learn dynamics that exactly mimic linear networks up to a time rescaling. It establishes a depth-based expressivity gap, as deep bias-free networks can implement certain odd nonlinear functions not achievable by two-layer counterparts. The authors derive analytical time-course solutions for certain two-layer cases and reveal that, with symmetric data, the learning dynamics reduce to linear-network equations, supporting the transfer of linear-network insights to ReLU architectures. In deep networks, they observe low-rank weight structures that resemble deep linear networks, offering a principled explanation for observed ReLU behavior in regimes of small initialization and symmetry. The results underscore when bias-free models behave like linear networks and highlight the necessity of biases for nonlinear learning tasks, while also clarifying regimes where linear theory provides accurate predictions.

Abstract

We investigate the implications of removing bias in ReLU networks regarding their expressivity and learning dynamics. We first show that two-layer bias-free ReLU networks have limited expressivity: the only odd function two-layer bias-free ReLU networks can express is a linear one. We then show that, under symmetry conditions on the data, these networks have the same learning dynamics as linear networks. This enables us to give analytical time-course solutions to certain two-layer bias-free (leaky) ReLU networks outside the lazy learning regime. While deep bias-free ReLU networks are more expressive than their two-layer counterparts, they still share a number of similarities with deep linear networks. These similarities enable us to leverage insights from linear networks to understand certain ReLU networks. Overall, our results show that some properties previously established for bias-free ReLU networks arise due to equivalence to linear networks.

When Are Bias-Free ReLU Networks Effectively Linear Networks?

TL;DR

Abstract

Paper Structure (31 sections, 13 theorems, 97 equations, 11 figures)

This paper contains 31 sections, 13 theorems, 97 equations, 11 figures.

Introduction
Related Work
Preliminaries
Two-Layer Bias-Free (Leaky) ReLU and Linear Networks
Deep Networks
Network Expressivity
Two-Layer Bias-Free (Leaky) ReLU Networks
Deep Bias-Free (Leaky) ReLU Networks
Learning Dynamics in Two-Layer Bias-Free ReLU Networks
Symmetric Datasets
Orthogonal and XOR Datasets
Learning Dynamics in Deep Bias-Free ReLU Networks
Discussion
Implication of Bias Removal
Perturbed Symmetric Dataset
...and 16 more sections

Key Result

Theorem 1

The set of functions that can be expressed by two-layer bias-free (leaky) ReLU networks is a subset of the set of functions of the form: $f({\bm{x}}) = h({\bm{x}}) + g({\bm{x}})$, where $h({\bm{x}})$ is linear and $g({\bm{x}})$ is a positively homogeneous even function, meaning $g({\bm{x}}) = g(-{\b

Figures (11)

Figure 1: The expressivity of two-layer and deep ReLU networks with and without bias. The networks are trained with logistic loss until the loss stops decreasing. The empty circles are data points with $+1$ labels; short lines are data points with $-1$ labels. The network output is plotted in color. (a) The fan dataset is odd, homogeneous, and satisfies \ref{['ass:sym-data']}. Two-layer bias-free ReLU networks cannot express it. (b) The circle dataset is not homogeneous. Two-layer and deep bias-free ReLU networks cannot express it. Experimental details are provided in \ref{['supp:implementation']}.
Figure 2: Function $g({\bm{x}})$ defined in \ref{['eq:depth-separation']} is plotted with color.
Figure 3: Two-layer bias-free (leaky) ReLU networks can evolve like a linear network. (a) Loss curves with different leaky ReLU parameters $\alpha$ (note $\alpha=1$ is a linear network). The simulations match the theoretical solutions in \ref{['eq:L2-wt-sol']}. The loss converges to global minimum, which is not zero due to the restricted expressivity of two-layer bias-free ReLU networks. (b) The simulated loss curves are plotted against a rescaled time axis; they collapse to one curve, demonstrating the (leaky) ReLU and linear networks are implementing the same linear function as in \ref{['eq:f=f_lin']}. The error, defined as $\left\| \sqrt{\frac{\alpha+1}{2}} {\bm{W}}\left(\frac{2}{\alpha+1} t \right) - {\bm{W}}^\mathrm{lin}(t) \right\| / \left\| {\bm{W}}^\mathrm{lin}(t) \right\|$, is less than $0.3\%$, demonstrating that the weights in the (leaky) ReLU network are close to the weights in the linear network as in \ref{['eq:W=W_lin']}. The errors are not exactly zero because the initial weights are sampled from a zero-mean Gaussian distribution, which does not satisfy \ref{['ass:L2-rank1']} but better reflects practical initialization schemes. Experimental details are provided in \ref{['supp:implementation']}.
Figure 4: Two-layer bias-free ReLU networks can evolve like multiple independent linear networks. (a) An orthogonal input dataset used in boursier22orthogonal. The $+$ and $-$ signs represent data points with $+1$ and $-1$ labels respectively. Their different colors are used only to distinguish the loss curves. The black arrows are the first-layer weights at convergence. (b) The loss curve of the ReLU network overlaps with two linear networks trained on each of the two data points respectively. (c) An XOR-like dataset. (d) The loss curve of the ReLU network overlaps with four linear networks trained on each of the four data points separately. Details: We use summed (instead of averaged) square loss for this figure. The initial losses are vertically aligned to help illustrate the overlap. More details are in \ref{['supp:implementation']}.
Figure 5: Low-rank weights in deep linear and ReLU bias-free networks. A three-layer linear network and a three-layer ReLU network are trained on the same dataset starting from the same small random weights. The dataset has a linear target function and an even empirical input data distribution. We plot the weights when the loss has approached zero. ${\bm{W}}_1$, ${\bm{W}}_3$, and positive elements in ${\bm{W}}_2$ have approximately the same structure in the linear and ReLU networks. Elements of ${\bm{W}}_2$ that are negative in the linear network are approximately zero in the ReLU network. The neurons are permuted for visualization. Simulations with deeper networks are presented in \ref{['fig:deep-lowrank']}. Experimental details are provided in \ref{['supp:implementation']}.
...and 6 more figures

Theorems & Definitions (26)

Theorem 1
proof
Corollary 2
Remark 4
Lemma 5
Lemma 6
Theorem 8
Corollary 9
Corollary 10
Conjecture 11
...and 16 more

When Are Bias-Free ReLU Networks Effectively Linear Networks?

TL;DR

Abstract

When Are Bias-Free ReLU Networks Effectively Linear Networks?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (26)