Why ReLU? A Bit-Model Dichotomy for Deep Network Training

Ilan Doron-Arad; Elchanan Mossel

Why ReLU? A Bit-Model Dichotomy for Deep Network Training

Ilan Doron-Arad, Elchanan Mossel

TL;DR

This work analyzes the theoretical complexity of ERM under a realistic bit-level model, where network parameters and inputs are constrained to be rational numbers with polynomially bounded bit-lengths, and demonstrates that finite-precision constraints are not merely implementation details but fundamental determinants of learnability.

Abstract

Theoretical analyses of Empirical Risk Minimization (ERM) are standardly framed within the Real-RAM model of computation. In this setting, training even simple neural networks is known to be $\exists \mathbb{R}$-complete -- a complexity class believed to be harder than NP, that characterizes the difficulty of solving systems of polynomial inequalities over the real numbers. However, this algebraic framework diverges from the reality of digital computation with finite-precision hardware. In this work, we analyze the theoretical complexity of ERM under a realistic bit-level model ($\mathsf{ERM}_{\text{bit}}$), where network parameters and inputs are constrained to be rational numbers with polynomially bounded bit-lengths. Under this model, we reveal a sharp dichotomy in tractability governed by the network's activation function. We prove that for deep networks with {\em any} polynomial activations with rational coefficients and degree at least $2$, the bit-complexity of training is severe: deciding $\mathsf{ERM}_{\text{bit}}$ is $\#P$-Hard, hence believed to be strictly harder than NP-complete problems. Furthermore, we show that determining the sign of a single partial derivative of the empirical loss function is intractable (unlikely in BPP), and deciding a specific bit in the gradient is $\#P$-Hard. This provides a complexity-theoretic perspective for the phenomenon of exploding and vanishing gradients. In contrast, we show that for piecewise-linear activations such as ReLU, the precision requirements remain manageable: $\mathsf{ERM}_{\text{bit}}$ is contained within NP (specifically NP-complete), and standard backpropagation runs in polynomial time. Our results demonstrate that finite-precision constraints are not merely implementation details but fundamental determinants of learnability.

Why ReLU? A Bit-Model Dichotomy for Deep Network Training

TL;DR

Abstract

Theoretical analyses of Empirical Risk Minimization (ERM) are standardly framed within the Real-RAM model of computation. In this setting, training even simple neural networks is known to be

-complete -- a complexity class believed to be harder than NP, that characterizes the difficulty of solving systems of polynomial inequalities over the real numbers. However, this algebraic framework diverges from the reality of digital computation with finite-precision hardware. In this work, we analyze the theoretical complexity of ERM under a realistic bit-level model (

), where network parameters and inputs are constrained to be rational numbers with polynomially bounded bit-lengths. Under this model, we reveal a sharp dichotomy in tractability governed by the network's activation function. We prove that for deep networks with {\em any} polynomial activations with rational coefficients and degree at least

, the bit-complexity of training is severe: deciding

-Hard, hence believed to be strictly harder than NP-complete problems. Furthermore, we show that determining the sign of a single partial derivative of the empirical loss function is intractable (unlikely in BPP), and deciding a specific bit in the gradient is

-Hard. This provides a complexity-theoretic perspective for the phenomenon of exploding and vanishing gradients. In contrast, we show that for piecewise-linear activations such as ReLU, the precision requirements remain manageable:

is contained within NP (specifically NP-complete), and standard backpropagation runs in polynomial time. Our results demonstrate that finite-precision constraints are not merely implementation details but fundamental determinants of learnability.

Paper Structure (40 sections, 17 theorems, 112 equations, 3 figures, 1 table)

This paper contains 40 sections, 17 theorems, 112 equations, 3 figures, 1 table.

Introduction
Main Results
Discussion
Loss Functions
Dimensions and Regularization
Number representation of deep networks in practice.
Smooth activations in practice.
Quantization and finite-precision training.
Related Work
Real-RAM computation and $\exists\mathbb{R}$-hardness.
Bit-model hardness and SLP.
Hardness and algorithms for ERM.
Exploding/vanishing gradients.
Organization.
Preliminaries
...and 25 more sections

Key Result

Theorem 1.1

Deciding $\mathsf{ERM}_{\text{bit}}$ for deep networks with activations in $\{\sigma,\textnormal{id}\}$ is $\textnormal{\#P}$-hard (under polynomial-time Turing reductions), where $\sigma \in \mathbb{Q}[T]$ is any non-linear polynomial activation with degree $\ge 2$. This result holds for $0/1$ loss

Figures (3)

Figure 1: Illustration of Depth-dependent gradient growth for polynomial vs. ReLU activations. We plot $\log_{10}\|\nabla_{W_1} L\|_2$ versus depth for randomly initialized MLPs with $\sigma(x)=x^2$ or ReLU, where in both cases weights are scaled by a factor of 3 on every layer. Gradients increase with depth for both activations, but much faster for the polynomial one, qualitatively reflecting our bit-complexity separation. This experiment is illustrative rather than a formal bound.
Figure 2: Proof roadmap: SLP based hardness is embedded into deep networks via a fixed polynomial activation $\sigma$, yielding conditional lower bounds for $\mathsf{ERM}_{\text{bit}}$ and Backprop-Sign; piecewise-linear activations admit NP verification and polynomial-time exact backprop in the bit model.
Figure 3: Multiplication gadget using a fixed polynomial activation $\sigma\in\mathbb{Q}[T]$. For each integer shift $j=0,\dots,{\mu}$, the network forms three affine combinations $x+y+j$, $x+j$, and $y+j$, applies $\sigma$, and then takes a fixed linear combination with coefficients $\lambda_j\in\mathbb{Q}$ (indicated by the "linear combination" node).

Theorems & Definitions (44)

Theorem 1.1
Corollary 1.2
Theorem 1.3
Theorem 1.4
Remark 1.5: Bit-bounded activations: polynomial-time evaluation and backpropagation
Theorem 1.6
Definition 2.1: $(a,b)$-promise $\mathsf{ERM}_{\text{bit}}$
Definition 2.2
Theorem 2.4
Theorem 2.6
...and 34 more

Why ReLU? A Bit-Model Dichotomy for Deep Network Training

TL;DR

Abstract

Why ReLU? A Bit-Model Dichotomy for Deep Network Training

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (44)