Table of Contents
Fetching ...

The boundary of neural network trainability is fractal

Jascha Sohl-Dickstein

TL;DR

The paper investigates whether the boundary between trainable and untrainable neural network hyperparameters exhibits fractal structure, by treating neural network training as an iterative map $f(W; \eta) = W - \eta\,g(W)$. The authors conduct systematic full-batch and minibatch experiments on a one-hidden-layer network with mean-field parameterization, grid-searching over hyperparameters and visualizing the resulting trainability landscapes; fractal dimensions are estimated via box-counting. Across six experimental conditions (tanh, ReLU, identity, minibatch, single datapoint, and alternative initialization schedule), the boundary consistently displays fractal behavior, with estimated fractal dimensions ranging roughly from $1.17$ to $1.98$. These findings suggest that meta-loss landscapes and hyperparameter sensitivity in neural network training may inherit fractal properties, offering a new lens for meta-learning and hyperparameter optimization near the edge of stability.

Abstract

Some fractals -- for instance those associated with the Mandelbrot and quadratic Julia sets -- are computed by iterating a function, and identifying the boundary between hyperparameters for which the resulting series diverges or remains bounded. Neural network training similarly involves iterating an update function (e.g. repeated steps of gradient descent), can result in convergent or divergent behavior, and can be extremely sensitive to small changes in hyperparameters. Motivated by these similarities, we experimentally examine the boundary between neural network hyperparameters that lead to stable and divergent training. We find that this boundary is fractal over more than ten decades of scale in all tested configurations.

The boundary of neural network trainability is fractal

TL;DR

The paper investigates whether the boundary between trainable and untrainable neural network hyperparameters exhibits fractal structure, by treating neural network training as an iterative map . The authors conduct systematic full-batch and minibatch experiments on a one-hidden-layer network with mean-field parameterization, grid-searching over hyperparameters and visualizing the resulting trainability landscapes; fractal dimensions are estimated via box-counting. Across six experimental conditions (tanh, ReLU, identity, minibatch, single datapoint, and alternative initialization schedule), the boundary consistently displays fractal behavior, with estimated fractal dimensions ranging roughly from to . These findings suggest that meta-loss landscapes and hyperparameter sensitivity in neural network training may inherit fractal properties, offering a new lens for meta-learning and hyperparameter optimization near the edge of stability.

Abstract

Some fractals -- for instance those associated with the Mandelbrot and quadratic Julia sets -- are computed by iterating a function, and identifying the boundary between hyperparameters for which the resulting series diverges or remains bounded. Neural network training similarly involves iterating an update function (e.g. repeated steps of gradient descent), can result in convergent or divergent behavior, and can be extremely sensitive to small changes in hyperparameters. Motivated by these similarities, we experimentally examine the boundary between neural network hyperparameters that lead to stable and divergent training. We find that this boundary is fractal over more than ten decades of scale in all tested configurations.
Paper Structure (13 sections, 1 equation, 1 figure)

This paper contains 13 sections, 1 equation, 1 figure.

Figures (1)

  • Figure 1: The boundary between trainable and untrainable neural network hyperparameters is fractal, for all experimental conditions. Images show a 2d grid search over neural network hyperparameters. For points shaded red, training diverged. For points shaded blue, training converged. Paler points correspond to faster convergence or divergence. Experimental conditions include different network nonlinearities, both minibatch and full batch training, and grid searching over either training or initialization hyperparameters. See Section \ref{['sec:exp cond']} for details. Each image is a hyperlink to an animation zooming into the corresponding fractal landscape (to the depth at which float64 discretization artifacts appear). Experimental code, images, and videos are available at https://github.com/Sohl-Dickstein/fractal.