When Are Bias-Free ReLU Networks Effectively Linear Networks?
Yedi Zhang, Andrew Saxe, Peter E. Latham
TL;DR
This work analyzes the consequences of removing bias terms in ReLU networks, showing that two-layer bias-free (or leaky) networks have restricted expressivity and, under symmetric data, learn dynamics that exactly mimic linear networks up to a time rescaling. It establishes a depth-based expressivity gap, as deep bias-free networks can implement certain odd nonlinear functions not achievable by two-layer counterparts. The authors derive analytical time-course solutions for certain two-layer cases and reveal that, with symmetric data, the learning dynamics reduce to linear-network equations, supporting the transfer of linear-network insights to ReLU architectures. In deep networks, they observe low-rank weight structures that resemble deep linear networks, offering a principled explanation for observed ReLU behavior in regimes of small initialization and symmetry. The results underscore when bias-free models behave like linear networks and highlight the necessity of biases for nonlinear learning tasks, while also clarifying regimes where linear theory provides accurate predictions.
Abstract
We investigate the implications of removing bias in ReLU networks regarding their expressivity and learning dynamics. We first show that two-layer bias-free ReLU networks have limited expressivity: the only odd function two-layer bias-free ReLU networks can express is a linear one. We then show that, under symmetry conditions on the data, these networks have the same learning dynamics as linear networks. This enables us to give analytical time-course solutions to certain two-layer bias-free (leaky) ReLU networks outside the lazy learning regime. While deep bias-free ReLU networks are more expressive than their two-layer counterparts, they still share a number of similarities with deep linear networks. These similarities enable us to leverage insights from linear networks to understand certain ReLU networks. Overall, our results show that some properties previously established for bias-free ReLU networks arise due to equivalence to linear networks.
