Scaling of learning time for high dimensional inputs

Carlos Stein Brito

Scaling of learning time for high dimensional inputs

Carlos Stein Brito

TL;DR

It is shown that the learning dynamics reduce to a unidimensional problem, with learning times dependent only on initial conditions, and a new framework for analyzing learning dynamics and model complexity in neural network models is outlined.

Abstract

Representation learning from complex data typically involves models with a large number of parameters, which in turn require large amounts of data samples. In neural network models, model complexity grows with the number of inputs to each neuron, with a trade-off between model expressivity and learning time. A precise characterization of this trade-off would help explain the connectivity and learning times observed in artificial and biological networks. We present a theoretical analysis of how learning time depends on input dimensionality for a Hebbian learning model performing independent component analysis. Based on the geometry of high-dimensional spaces, we show that the learning dynamics reduce to a unidimensional problem, with learning times dependent only on initial conditions. For higher input dimensions, initial parameters have smaller learning gradients and larger learning times. We find that learning times have supralinear scaling, becoming quickly prohibitive for high input dimensions. These results reveal a fundamental limitation for learning in high dimensions and help elucidate how the optimal design of neural networks depends on data complexity. Our approach outlines a new framework for analyzing learning dynamics and model complexity in neural network models.

Scaling of learning time for high dimensional inputs

TL;DR

Abstract

Paper Structure (18 sections, 21 equations, 5 figures)

This paper contains 18 sections, 21 equations, 5 figures.

Introduction
Results
Nonlinear Hebbian learning of sparse features
The geometry of the optimization surface for input weights
Quasi-orthogonal random directions in high-dimensional spaces
Reduction to unidimensional dynamics
Learning time dependence on input dimensionality
Discussion
Extensions to alternative neural network models
Theory for convolutional neural networks
Implications for neural network learning dynamics
Theory for localized receptive fields and number of synapses in the cortex
Methods
Number of maxima and saddle points
Data generation
...and 3 more sections

Figures (5)

Figure 1: Geometry of the optimization surface for synaptic weights. (a) Prototypical minimum in a convex surface with a basin of attraction. (left) Gray scale indicates optimization function value, with minimum in white. (right) Color heat map indicates the magnitude of the gradient, with zero gradients in dark blue. (b) Prototypical maximum in a concave surface, with an unstable equilibrium point. (c) Saddle points are concave in some parameter directions, while convex in others. (d) Surface for the gradient magnitude for three weights. Each dimension represents the value of a synaptic weight. As we enforce $|\mathbf{w}|_2 = 1$, the parameter space is constrained to the unit sphere. The cardinal directions are the minima, the directions of the hidden patterns. The symmetric directions, where all weights have the same magnitude, are maxima. Partially symmetric directions, where two weights have the same magnitude, and the third is zero, are saddle points. Areas in blue indicate low gradient magnitude, and thus slow learning dynamics.
Figure 2: Initial parameters have small overlap with hidden features in high dimensions. (a) The expected overlap of initial parameters with a hidden feature decays with $\sqrt{N}$ (fixed $K = 10$). Dashed line is power-law with exponent $\alpha = -0.5$. (b) Increasing the number of hidden features has only a logarithmic effect on the expected initial overlap (fixed $N = 1000$).
Figure 3: Stereotypical learning dynamics in high-dimensional optimization. (a) Evolution of the overlap of the weights with a hidden feature, for different dimensionality $N$. The color heat indicates $N$, varying from $N = 10$ (purple) to $N = 160$ (red). One run for each $N$, with random initial weights. (b) Same trajectories shifted to a referent time where an overlap of $d = 0.75$ was reached, highlighting the similarity in the learning dynamics.
Figure 4: Gradient dependency on the overlap $d$ for symmetric sparse distributions. (a) The gradient magnitude $\mu(d)$ vanishes when the overlap goes to zero. (b) The gradient variability $\sigma(d)$ does not change significantly for small overlaps. (c) The gradient signal-to-noise ratio $\mu(d)^2/\sigma(d)^2$ follows the profile of the gradient magnitude. (d) For small overlaps, $d \to 0$, the gradient has a power-law dependency on the overlap (dashed line is power-law with exponent $\alpha = 3.$).
Figure 5: Learning time dependence on the number of inputs. Learning time for input dimensions $N$, and $K=N$ latent features (in blue) match the predicted scaling by the theory, $T \propto \frac{N^{2}}{log(N)}$. Learning time was defined as the passing time of the overlap at $d = 0.7$, averaged over 100 simulations.

Scaling of learning time for high dimensional inputs

TL;DR

Abstract

Scaling of learning time for high dimensional inputs

Authors

TL;DR

Abstract

Table of Contents

Figures (5)