Failures of Gradient-Based Deep Learning
Shai Shalev-Shwartz, Ohad Shamir, Shaked Shammah
TL;DR
The paper analyzes systematic failures of gradient-based deep learning on deliberately simple problems, showing that gradient signals can be uninformative, gradients can have vanishing signal-to-noise ratios, and optimization can be ill-conditioned. It provides theoretical bounds based on statistical-query perspectives and gradient variance to explain why gradient-based methods fail on parity/linear-periodic tasks and certain end-to-end decompositions. Through targeted experiments (parity learning, k-rectangles, and piecewise-linear encodings), it demonstrates that architectural choices (notably convolutional designs) and explicit conditioning dramatically improve optimization speed, while non-gradient update rules can overcome limitations posed by flat activations. The work proposes practical remedies—decomposition, conditioning, and forward-only updates—that can extend the robustness of learning algorithms to challenging problem classes, with implications for when and how to apply gradient-based methods in practice.
Abstract
In recent years, Deep Learning has become the go-to solution for a broad range of applications, often outperforming state-of-the-art. However, it is important, for both theoreticians and practitioners, to gain a deeper understanding of the difficulties and limitations associated with common approaches and algorithms. We describe four types of simple problems, for which the gradient-based algorithms commonly used in deep learning either fail or suffer from significant difficulties. We illustrate the failures through practical experiments, and provide theoretical insights explaining their source, and how they might be remedied.
