Table of Contents
Fetching ...

Riddled basin geometry sets fundamental limits to predictability and reproducibility in deep learning

Andrew Ly, Pulin Gong

TL;DR

This work derives sufficient conditions for the emergence of riddled basins by analytically linking features widely observed in deep learning, including chaotic learning dynamics and symmetry-induced invariant subspaces, to reveal a general route to riddling in realistic deep networks.

Abstract

Fundamental limits to predictability are central to our understanding of many physical and computational systems. Here we show that, despite its remarkable capabilities, deep learning exhibits such fundamental limits rooted in the fractal, riddled geometry of its basins of attraction: any initialization that leads to one solution lies arbitrarily close to another that leads to a different one. We derive sufficient conditions for the emergence of riddled basins by analytically linking features widely observed in deep learning, including chaotic learning dynamics and symmetry-induced invariant subspaces, to reveal a general route to riddling in realistic deep networks. The resulting basins of attraction possess an infinitely fine-scale fractal structure characterized by an uncertainty exponent near zero, so that even large increases in the precision of initial conditions yield only marginal gains in outcome predictability. Riddling thus imposes a fundamental limit on the predictability and hence reproducibility of neural network training, providing a unified account of many empirical observations. These results reveal a general organizing principle of deep learning with important implications for optimization and the safe deployment of artificial intelligence.

Riddled basin geometry sets fundamental limits to predictability and reproducibility in deep learning

TL;DR

This work derives sufficient conditions for the emergence of riddled basins by analytically linking features widely observed in deep learning, including chaotic learning dynamics and symmetry-induced invariant subspaces, to reveal a general route to riddling in realistic deep networks.

Abstract

Fundamental limits to predictability are central to our understanding of many physical and computational systems. Here we show that, despite its remarkable capabilities, deep learning exhibits such fundamental limits rooted in the fractal, riddled geometry of its basins of attraction: any initialization that leads to one solution lies arbitrarily close to another that leads to a different one. We derive sufficient conditions for the emergence of riddled basins by analytically linking features widely observed in deep learning, including chaotic learning dynamics and symmetry-induced invariant subspaces, to reveal a general route to riddling in realistic deep networks. The resulting basins of attraction possess an infinitely fine-scale fractal structure characterized by an uncertainty exponent near zero, so that even large increases in the precision of initial conditions yield only marginal gains in outcome predictability. Riddling thus imposes a fundamental limit on the predictability and hence reproducibility of neural network training, providing a unified account of many empirical observations. These results reveal a general organizing principle of deep learning with important implications for optimization and the safe deployment of artificial intelligence.

Paper Structure

This paper contains 18 sections, 11 equations, 10 figures.

Figures (10)

  • Figure 1: Schematic of riddled basins and outcome unpredictability.a, The basin of attractor $A$ (blue) is riddled with that of attractor $B$ (orange). Arrows indicate successive magnifications centered on the white cross; the final panel zooms in on the boxed region, revealing interleaved fractal structure at arbitrarily fine scales. b, The fractal structure is quantified by $f(\varepsilon)$, the probability that a random perturbation of magnitude $\varepsilon$ changes the attractor. Error bars represent 95% confidence intervals. A near-zero uncertainty exponent $\phi$, defined by $f(\varepsilon) \sim \varepsilon^\phi$, indicates that increasing the precision of the initialization yields only marginal gains in predictability; the qualitative fate of a given initialization ($A$ or $B$?) remains effectively unpredictable, thus undermining reproducibility.
  • Figure 2: Chaotic attractor in the training of the minimal model.a, A chaotic attractor within the permutation-invariant plane $\mathcal{P}_{+}$ is traced by the training trajectory from a random initialization $\bm\uptheta_0 \in \mathcal{P}_{+}$ (see "Methods" for details). Each point represents the coordinates of an iterate with respect to the basis of $\mathcal{P}_{+}$, comprising $\mathbf{e}_1 = (1,1,0,0)/\sqrt{2}$ and $\mathbf{e}_2 = (0,0,1,1)/\sqrt{2}$; color encodes epoch. b, Distributions of finite time-$T$ transverse Lyapunov exponents, $\lambda_3^T$ for $T = 32, 128, 512$, show non-zero fractions of positive values: $25.3\%$, $9.4\%$ and $0.8\%$, respectively. c, The inverse mean squared fluctuations of finite-time exponents, $\langle (\lambda_3^T - \lambda_3)^2\rangle^{-1}$, grows linearly with $T$ for large $T$. Error bars, which denote 95% confidence intervals, are smaller than the points.
  • Figure 3: Riddling in the minimal model.a, Destination map for a $2047 \times 2047$ uniform grid of initializations on the plane spanned by two random directions $\mathbf{e}_{\parallel}$ and $\mathbf{e}_{\perp}$, which are longitudinal and transverse to the $\mathcal{P}_{+}$ permutation-invariant subspace, respectively. Initializations converging to $\mathcal{P}_{+}$, $\mathcal{P}_{-}$ and infinity are colored blue, orange and white, respectively. The resulting basins of attraction exhibit a striking butterfly-like pattern. b, Magnification of the right inset in (a) on a $1024 \times 1024$ grid. c, Magnification of the inset in (b) on a $1024 \times 1024$ grid. d, Magnification of the left inset in (a) on a $1024 \times 1024$ grid; for visibility of fine-scale structure, blue and orange are replaced with white and black, respectively. e, The uncertainty fraction $f(\varepsilon)$ exhibits small-$\varepsilon$ scaling $f(\varepsilon) \sim \varepsilon^{\phi}$ with uncertainty exponent $\phi = 0.0126\pm 0.0002$. Error bars denote 95% confidence intervals.
  • Figure 4: Riddling in deep neural network training.a, A randomly initialized VGG-12 network is attracted to a parity-invariant subspace during training. The vectorized network weights $\bm{\uptheta}$ are projected onto three random dimensions: $\mathbf{e}_{\parallel,1}$, $\mathbf{e}_{\parallel,2}$ (longitudinal to the invariant subspace) and $\mathbf{e}_{\perp}$ (transverse). The shadow on each pane is a two-dimensional histogram, where darker shades indicate higher frequency. b, Training destination map for a $255 \times 255$ uniform grid of initializations on the plane spanned by $\mathbf{e}_{\parallel} = \mathbf{e}_{\parallel,1}$ and $\mathbf{e}_{\perp}$. Each color denotes a unique parity-invariant subspace; in total, there are 1772 different destinations. White points approach the origin. c, Same as (b), except initializations converging to the invariant subspace at $\bm{\uptheta} \cdot \mathbf{e}_\perp = 0$ are colored black. All other destinations are white. d, Magnification around the red dot in (c) on a $128 \times 128$ grid. e, The uncertainty fraction $f(\varepsilon)$ for initializations within a $\varepsilon$-hypercube centered at the middle of (d), $(\bm{\uptheta}\cdot \mathbf{e}_{\parallel}, \bm{\uptheta}\cdot \mathbf{e}_{\perp}) = (16.0005, 0.0355)$. Networks are trained on CPU to ensure determinism. Dots and error bars show the mean and $95\%$ confidence intervals from bootstrap resampling. A power-law fit $f(\varepsilon) \propto \varepsilon^\phi$ (dashed line) yields the uncertainty exponent $\phi = 0.000 \pm 0.002$. f, Same as (e), except with GPU training (faster but non-deterministic). The uncertainty exponent is also $\phi = 0.000 \pm 0.002$.
  • Figure 5: Training of the minimal model converges to permutation-invariant planes. A $64 \times 64$ uniform grid of initializations on the plane spanned by random orthonormal vectors, $\mathbf{e}_1$ and $\mathbf{e}_2$, is trained for $10^3$ epochs. a, Training with a learning rate of $\eta = 1$. Color encodes the nearest invariant subspace: $\mathcal{P}_{+}$ is blue and $\mathcal{P}_{-}$ is red. Color intensity represents the distance to this subspace, $d_{\pm}(\bm\uptheta) = \Vert \mathbf{w}_1 \mp \mathbf{w}_2 \Vert^2$. b, Same as (a), with $\eta = 1.5$. c, Same as (a), with $\eta = 2$. d, Evolution of the weights for a representative initialization from the $\eta=1.5$ grid that converges to $\mathcal{P}_{+}$. e, Same as (c), for an initialization that converges to $\mathcal{P}_{-}$. f, Same as (c), for an initialization that does not converge to either $\mathcal{P}_{+}$ or $\mathcal{P}_{-}$.
  • ...and 5 more figures