Table of Contents
Fetching ...

A Novel Explanation Against Linear Neural Networks

Anish Lakkapragada

TL;DR

The paper challenges the conventional view that neural networks need nonlinear activations to outperform linear models by proposing that linear neural networks (LNNs) can actually underperform due to optimization difficulties from excess parameters. Through an optimization analysis of a two-layer LNN and extensive synthetic-data experiments across varying noise levels, the authors demonstrate that random initialization and interdependent parameters hinder convergence to the true linear mapping, causing higher training and testing error as depth increases. They show that the optimal linear solution is attainable only in linear regression or very shallow LNNs, while deeper architectures consistently converge to suboptimal solutions, highlighting a practical drawback of increasing parameter count without activation nonlinearities. The work suggests a fundamental limitation of LNNs for simple linear data and emphasizes the importance of activation-driven nonlinearity for scalable learning, with potential implications for model selection and initialization strategies in practice.

Abstract

Linear Regression and neural networks are widely used to model data. Neural networks distinguish themselves from linear regression with their use of activation functions that enable modeling nonlinear functions. The standard argument for these activation functions is that without them, neural networks only can model a line. However, a novel explanation we propose in this paper for the impracticality of neural networks without activation functions, or linear neural networks, is that they actually reduce both training and testing performance. Having more parameters makes LNNs harder to optimize, and thus they require more training iterations than linear regression to even potentially converge to the optimal solution. We prove this hypothesis through an analysis of the optimization of an LNN and rigorous testing comparing the performance between both LNNs and linear regression on synthethic, noisy datasets.

A Novel Explanation Against Linear Neural Networks

TL;DR

The paper challenges the conventional view that neural networks need nonlinear activations to outperform linear models by proposing that linear neural networks (LNNs) can actually underperform due to optimization difficulties from excess parameters. Through an optimization analysis of a two-layer LNN and extensive synthetic-data experiments across varying noise levels, the authors demonstrate that random initialization and interdependent parameters hinder convergence to the true linear mapping, causing higher training and testing error as depth increases. They show that the optimal linear solution is attainable only in linear regression or very shallow LNNs, while deeper architectures consistently converge to suboptimal solutions, highlighting a practical drawback of increasing parameter count without activation nonlinearities. The work suggests a fundamental limitation of LNNs for simple linear data and emphasizes the importance of activation-driven nonlinearity for scalable learning, with potential implications for model selection and initialization strategies in practice.

Abstract

Linear Regression and neural networks are widely used to model data. Neural networks distinguish themselves from linear regression with their use of activation functions that enable modeling nonlinear functions. The standard argument for these activation functions is that without them, neural networks only can model a line. However, a novel explanation we propose in this paper for the impracticality of neural networks without activation functions, or linear neural networks, is that they actually reduce both training and testing performance. Having more parameters makes LNNs harder to optimize, and thus they require more training iterations than linear regression to even potentially converge to the optimal solution. We prove this hypothesis through an analysis of the optimization of an LNN and rigorous testing comparing the performance between both LNNs and linear regression on synthethic, noisy datasets.
Paper Structure (7 sections, 3 equations, 2 figures, 1 table)

This paper contains 7 sections, 3 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Plot of the average optimal parameter deviation $D$ for each model across all 100 training runs.
  • Figure 2: Trendlines of testing MSE as LNN parameter count/layers increases across all noise levels.