Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training
Max Milkert, David Hyde, Forrest Laine
TL;DR
This work tackles the challenge that random initialization often underutilizes the potential expressivity of deep ReLU networks by proposing a triangle-wave-based reparameterization that fixes $2^d$ linear regions at depth $d$ and preserves them during training. A differentiability-guided pretraining procedure enforces a scaling law $s_{i+1}=s_i(1-a_{i+1})a_{i+2}$, enabling accurate learning of convex univariate functions and extending to non-convex and multidimensional cases. The approach yields orders-of-magnitude improvements in one-dimensional function approximation and can serve as a drop-in replacement for dense layers in larger architectures, with preliminary results on image classification showing early gains but comparable final accuracy. This suggests potential for more parameter-efficient, better-conditioned networks, while signaling the need for further work to scale the method to dense, high-dimensional architectures and broader real-world tasks.
Abstract
In a neural network with ReLU activations, the number of piecewise linear regions in the output can grow exponentially with depth. However, this is highly unlikely to happen when the initial parameters are sampled randomly, which therefore often leads to the use of networks that are unnecessarily large. To address this problem, we introduce a novel parameterization of the network that restricts its weights so that a depth $d$ network produces exactly $2^d$ linear regions at initialization and maintains those regions throughout training under the parameterization. This approach allows us to learn approximations of convex, one dimensional functions that are several orders of magnitude more accurate than their randomly initialized counterparts. We further demonstrate a preliminary extension of our construction to multidimensional and non-convex functions, allowing the technique to replace traditional dense layers in various architectures.
