Table of Contents
Fetching ...

Memorization With Neural Nets: Going Beyond the Worst Case

Sjoerd Dirksen, Patrick Finke, Martin Genzel

TL;DR

This paper introduces a simple randomized algorithm that, given a fixed finite data set with two classes, with high probability constructs an interpolating three-layer neural network in polynomial time and obtains guarantees that are independent of the number of samples and hence move beyond worst-case memorization capacity bounds.

Abstract

In practice, deep neural networks are often able to easily interpolate their training data. To understand this phenomenon, many works have aimed to quantify the memorization capacity of a neural network architecture: the largest number of points such that the architecture can interpolate any placement of these points with any assignment of labels. For real-world data, however, one intuitively expects the presence of a benign structure so that interpolation already occurs at a smaller network size than suggested by memorization capacity. In this paper, we investigate interpolation by adopting an instance-specific viewpoint. We introduce a simple randomized algorithm that, given a fixed finite data set with two classes, with high probability constructs an interpolating three-layer neural network in polynomial time. The required number of parameters is linked to geometric properties of the two classes and their mutual arrangement. As a result, we obtain guarantees that are independent of the number of samples and hence move beyond worst-case memorization capacity bounds. We verify our theoretical result with numerical experiments and additionally investigate the effectiveness of the algorithm on MNIST and CIFAR-10.

Memorization With Neural Nets: Going Beyond the Worst Case

TL;DR

This paper introduces a simple randomized algorithm that, given a fixed finite data set with two classes, with high probability constructs an interpolating three-layer neural network in polynomial time and obtains guarantees that are independent of the number of samples and hence move beyond worst-case memorization capacity bounds.

Abstract

In practice, deep neural networks are often able to easily interpolate their training data. To understand this phenomenon, many works have aimed to quantify the memorization capacity of a neural network architecture: the largest number of points such that the architecture can interpolate any placement of these points with any assignment of labels. For real-world data, however, one intuitively expects the presence of a benign structure so that interpolation already occurs at a smaller network size than suggested by memorization capacity. In this paper, we investigate interpolation by adopting an instance-specific viewpoint. We introduce a simple randomized algorithm that, given a fixed finite data set with two classes, with high probability constructs an interpolating three-layer neural network in polynomial time. The required number of parameters is linked to geometric properties of the two classes and their mutual arrangement. As a result, we obtain guarantees that are independent of the number of samples and hence move beyond worst-case memorization capacity bounds. We verify our theoretical result with numerical experiments and additionally investigate the effectiveness of the algorithm on MNIST and CIFAR-10.
Paper Structure (32 sections, 12 theorems, 70 equations, 14 figures, 2 algorithms)

This paper contains 32 sections, 12 theorems, 70 equations, 14 figures, 2 algorithms.

Key Result

Theorem 4

Let $\mathcal{X}^-, \mathcal{X}^+ \subset R\mathbb{B}_2^d$ be finite and disjoint. Suppose that there is a mutual covering with $\delta$-separated centers and radii satisfying eq:condition-radius. Then, with high probability, Algorithm alg:pruning terminates in polynomial time and outputs a $2$-hidd neurons and parameters, that interpolates $\mathcal{X}^-$ and $\mathcal{X}^+$.

Figures (14)

  • Figure 1: The mutual covering is 'problem-adaptive'. Condition \ref{['eq:condition-radius']} on the radii in Theorem \ref{['thm:main-result-informal']} allows a covering 'adapted to' the mutual arrangement of the data: only the parts of the data that lie close to the ideal decision boundary need to be covered using balls with small diameters---other parts can be crudely covered using larger balls.
  • Figure 2: Random hyperplanes in the input domain $\mathbb{R}^d$. In Algorithm \ref{['alg:pruning']} we iteratively sample random hyperplanes $H[\bm{w}_i, b_i]$ until every pair of points with opposite labels is separated by at least one of them. This tessellates the space into multiple cells, where each cell is only populated with points of the same label. Each hyperplane can be associated with one of the neurons of the first layer $\Phi$.
  • Figure 3: The effect of the first layer $\Phi$. After transforming the data with the first layer $\Phi$ we can, for each $\bm{x}^- \in \mathcal{X}^-$, construct a hyperplane $H[-\bm{u}_{\bm{x}^-}, m_{\bm{x}^-}]$ that separates $\Phi(\mathcal{X}^+)$ from $\Phi(\bm{x}^-)$. Each hyperplane can be associated with one of the neurons in the second layer.
  • Figure 4: Motivation for forward selection. While each $\Phi(\bm{x}^-)$ is separated by a corresponding 'dedicated' hyperplane from $\Phi(\mathcal{X}^+)$ (depicted in dashed grey), we can identify a single hyperplane $H[-\bm{u}_{\bm{x}_*^-}, m_{\bm{x}_*^-}]$ (depicted in grey) that separates several $\Phi(\bm{x}^-)$ from $\Phi(\mathcal{X}^+)$ simultaneously. The other hyperplanes are redundant and the corresponding neurons do not need to be included in the second layer $\hat{\Phi}$.
  • Figure 5: Binary classification on the Two Moons data set.
  • ...and 9 more figures

Theorems & Definitions (19)

  • Definition 1: Interpolation
  • Definition 2: $\delta$-separation
  • Definition 3: Mutual covering
  • Theorem 4: Informal
  • Corollary 5
  • Remark 6
  • Proposition 7: Termination and correctness
  • Proposition 8: Run time
  • Remark 9
  • Proposition 10: Limit shape of activation regions---threshold activations
  • ...and 9 more