Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training

Max Milkert; David Hyde; Forrest Laine

Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training

Max Milkert, David Hyde, Forrest Laine

TL;DR

This work tackles the challenge that random initialization often underutilizes the potential expressivity of deep ReLU networks by proposing a triangle-wave-based reparameterization that fixes $2^d$ linear regions at depth $d$ and preserves them during training. A differentiability-guided pretraining procedure enforces a scaling law $s_{i+1}=s_i(1-a_{i+1})a_{i+2}$, enabling accurate learning of convex univariate functions and extending to non-convex and multidimensional cases. The approach yields orders-of-magnitude improvements in one-dimensional function approximation and can serve as a drop-in replacement for dense layers in larger architectures, with preliminary results on image classification showing early gains but comparable final accuracy. This suggests potential for more parameter-efficient, better-conditioned networks, while signaling the need for further work to scale the method to dense, high-dimensional architectures and broader real-world tasks.

Abstract

In a neural network with ReLU activations, the number of piecewise linear regions in the output can grow exponentially with depth. However, this is highly unlikely to happen when the initial parameters are sampled randomly, which therefore often leads to the use of networks that are unnecessarily large. To address this problem, we introduce a novel parameterization of the network that restricts its weights so that a depth $d$ network produces exactly $2^d$ linear regions at initialization and maintains those regions throughout training under the parameterization. This approach allows us to learn approximations of convex, one dimensional functions that are several orders of magnitude more accurate than their randomly initialized counterparts. We further demonstrate a preliminary extension of our construction to multidimensional and non-convex functions, allowing the technique to replace traditional dense layers in various architectures.

Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training

TL;DR

This work tackles the challenge that random initialization often underutilizes the potential expressivity of deep ReLU networks by proposing a triangle-wave-based reparameterization that fixes

linear regions at depth

and preserves them during training. A differentiability-guided pretraining procedure enforces a scaling law

, enabling accurate learning of convex univariate functions and extending to non-convex and multidimensional cases. The approach yields orders-of-magnitude improvements in one-dimensional function approximation and can serve as a drop-in replacement for dense layers in larger architectures, with preliminary results on image classification showing early gains but comparable final accuracy. This suggests potential for more parameter-efficient, better-conditioned networks, while signaling the need for further work to scale the method to dense, high-dimensional architectures and broader real-world tasks.

Abstract

network produces exactly

linear regions at initialization and maintains those regions throughout training under the parameterization. This approach allows us to learn approximations of convex, one dimensional functions that are several orders of magnitude more accurate than their randomly initialized counterparts. We further demonstrate a preliminary extension of our construction to multidimensional and non-convex functions, allowing the technique to replace traditional dense layers in various architectures.

Paper Structure (27 sections, 11 theorems, 34 equations, 17 figures, 3 tables, 1 algorithm)

This paper contains 27 sections, 11 theorems, 34 equations, 17 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Function Approximation
Neural Network Initialization
Initialization and Pretraining Construction
Pretraining and Overall Algorithm
Differentiable Model Output
One-Dimensional Experiments
Experimental Setup
Numerical Results
Extension to Non-Convex Functions and Higher Dimensions
Preliminary Image Classification Results
Concluding Remarks
Algorithm and Theory
Initialization and Pretraining Algorithm
...and 12 more sections

Key Result

Theorem $\mathbf{}$

$F'(x) = \sum_{i=0}^\infty s_i W'_i(x)$ and is continuous on $[0,1]$ only if the scaling coefficients $s_i$ are selected based on the triangle peaks $a_i$ according to:

Figures (17)

Figure 1: (Top to bottom) Composed triangle waves; using collections of the above function to approximate $x^2$; derivatives of the above approximations.
Figure 2: On the top left is a network representation of a triangle function. The top right shows that triangle function as a hidden layer of a network. The one-dimensional input and output of a triangle function is converted into shared weights. A full pretraining network is assembled on the bottom.
Figure 3: Each colored line shows the output signal of a neuron with respect to the input to the network. Colors match the corresponding neurons in Figure \ref{['fig:compositional-network']}.
Figure 4: Standard Kaiming initialization/gradient descent vs. pretraining with differentiablity enforced. Using more linear regions allows the curve to better predict the test points.
Figure 5: Neuron outputs of a default (Kaiming-initialized) network (left) versus a pretrained variant of our network (right). Notice that the first two layers of the default network introduce no linear regions - none of the lines cross zero. Any infinitesimal adjustment to the slopes or biases of the lines would not make such an intersection occur. Therefore, the number of linear regions generated by the network cannot be a local property, and we can expect gradient-based optimization to struggle at maximizing the linear region count. Our method uses this non-locality to our advantage. The pretraining phase finds a low-loss solution where $2^d$ linear regions are generated, which guarantees a neighborhood of parameter space where $2^d$ regions can be maintained while training in the standard parameterization.
...and 12 more figures

Theorems & Definitions (21)

Theorem $\mathbf{}$
Lemma $\mathbf{}$
proof
Lemma $\mathbf{}$
proof
Lemma $\mathbf{}$
proof
Lemma $\mathbf{}$
proof
proof
...and 11 more

Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training

TL;DR

Abstract

Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (21)