Table of Contents
Fetching ...

Complex fractal trainability boundary can arise from trivial non-convexity

Yizhou Liu

TL;DR

It is discovered that fractal boundaries can emerge from simple non-convex perturbations, i.e., adding or multiplying cosine type perturbations to quadratic functions, and concluded that fractal trainability boundaries can arise from very simple non-convexity.

Abstract

Training neural networks involves optimizing parameters to minimize a loss function, where the nature of the loss function and the optimization strategy are crucial for effective training. Hyperparameter choices, such as the learning rate in gradient descent (GD), significantly affect the success and speed of convergence. Recent studies indicate that the boundary between bounded and divergent hyperparameters can be fractal, complicating reliable hyperparameter selection. However, the nature of this fractal boundary and methods to avoid it remain unclear. In this study, we focus on GD to investigate the loss landscape properties that might lead to fractal trainability boundaries. We discovered that fractal boundaries can emerge from simple non-convex perturbations, i.e., adding or multiplying cosine type perturbations to quadratic functions. The observed fractal dimensions are influenced by factors like parameter dimension, type of non-convexity, perturbation wavelength, and perturbation amplitude. Our analysis identifies "roughness of perturbation", which measures the gradient's sensitivity to parameter changes, as the factor controlling fractal dimensions of trainability boundaries. We observed a clear transition from non-fractal to fractal trainability boundaries as roughness increases, with the critical roughness causing the perturbed loss function non-convex. Thus, we conclude that fractal trainability boundaries can arise from very simple non-convexity. We anticipate that our findings will enhance the understanding of complex behaviors during neural network training, leading to more consistent and predictable training strategies.

Complex fractal trainability boundary can arise from trivial non-convexity

TL;DR

It is discovered that fractal boundaries can emerge from simple non-convex perturbations, i.e., adding or multiplying cosine type perturbations to quadratic functions, and concluded that fractal trainability boundaries can arise from very simple non-convexity.

Abstract

Training neural networks involves optimizing parameters to minimize a loss function, where the nature of the loss function and the optimization strategy are crucial for effective training. Hyperparameter choices, such as the learning rate in gradient descent (GD), significantly affect the success and speed of convergence. Recent studies indicate that the boundary between bounded and divergent hyperparameters can be fractal, complicating reliable hyperparameter selection. However, the nature of this fractal boundary and methods to avoid it remain unclear. In this study, we focus on GD to investigate the loss landscape properties that might lead to fractal trainability boundaries. We discovered that fractal boundaries can emerge from simple non-convex perturbations, i.e., adding or multiplying cosine type perturbations to quadratic functions. The observed fractal dimensions are influenced by factors like parameter dimension, type of non-convexity, perturbation wavelength, and perturbation amplitude. Our analysis identifies "roughness of perturbation", which measures the gradient's sensitivity to parameter changes, as the factor controlling fractal dimensions of trainability boundaries. We observed a clear transition from non-fractal to fractal trainability boundaries as roughness increases, with the critical roughness causing the perturbed loss function non-convex. Thus, we conclude that fractal trainability boundaries can arise from very simple non-convexity. We anticipate that our findings will enhance the understanding of complex behaviors during neural network training, leading to more consistent and predictable training strategies.
Paper Structure (6 sections, 18 equations, 9 figures)

This paper contains 6 sections, 18 equations, 9 figures.

Figures (9)

  • Figure 1: On constructed loss functions, we conduct numerical experiments to study the trainability boundaries. (a) Illustration of loss landscapes with additive perturbation ($f_+$ with $\epsilon = 0.2$ and $\lambda = 0.1$). (b) An example of loss constructed having multiplicative perturbation ($f_\times$ with $\epsilon = 0.2$ and $\lambda = 0.1$). (c) On a fixed range of learning rate, we can put in $N$ small segments and evaluate whether training diverge or not at each end of the segments. We therefore can generate a set of boundary segments, $B_N$ and count the number of boundary segments. (d) An example when we have more segments (fine-grain), the number of boundary segments (black segments) is increasing (figure obtained based on multiplicative perturbation case $f_\times$ with parameters $\epsilon = 0.2$ and $\lambda = 0.1$). The colored bar at the bottom visualizes losses for bounded (blue) and divergent training (red).
  • Figure 2: Simple non-convexity can lead to fractal trainability boundaries, whose fractal dimensions depend on perturbation form, wavelength, and amplitude. (a) For additive perturbation case $f_+$ with $\epsilon = 0.2$ and $\lambda = 0.1$, we studied learning rate in $[0,1.5]$, where the number of boundary segments increases as a scaling law with respect to the number of segments put, suggesting a fractal trainability boundary. The fractal dimension, i.e., the slope $\log|B_N|$ against $\log N$ is fitted as $0.996 \pm 0.005$. (b) For additive perturbation case $f_\times$ with $\epsilon = 0.2$ and $\lambda = 0.1$, fractal trainability boundary is also observed with fractal dimension $0.837 \pm 0.004$. (c, d) The fractal dimension of trainability boundary vary with respect to perturbation amplitude $\epsilon$ and wavelength $\lambda$. In particular, for the additive perturbation case (c), the fractal dimension increases with larger amplitude and smaller wavelength. For the multiplicative case (d), the fractal dimension has no clear dependence on the two function parameters.
  • Figure 3: Roughness determines fractal dimension of trainability boundaries and captures the transition to fractal trainability boundary when the landscape is non-convex. (a) For the additive perturbation case, the roughness $\theta_+$ found well organizes the fractal dimensions $\alpha$ with different amplitude and wavelength (data in Fig. \ref{['fig:2']}c), where we can see a clear transition to non-zero fractal dimensions near $\theta_+ = 1/2\pi^2$ (dashed line, corresponding emergence of non-convexity). (b) For the multiplicative perturbation case, the roughness $\theta_\times$ found determines the fractal dimensions $\alpha$ (data from Fig. \ref{['fig:2']}d). Error bars are standard deviations of fitting.
  • Figure 4: Beyond simple cases we can analyze, fractal dimension of trainability boundary depends on many other parameters determine the loss function, while it seems to be general that non-convexity leads to fractal behaviors. (a) For additive case with two cosine perturbations, the fractal dimension depends complicatedly on the amplitudes, while it is true fractal behaviors show up after the loss is non-convex (red line is the boundary of convexity). (b and c) For high dimensional optimization, the fractal dimension can depend on parameter dimensions. The fractal dimension is robust to increasing the parameter dimension $d$ for the additive perturbation case. (b) While the fractal dimension increases with the parameter dimension $d$ for the multiplicative perturbation case. Error bars are standard deviations of fitting.
  • Figure S1: Training loss can be bounded or divergent. (a) An example obtained based on the multiplicative noise case with amplitude $\epsilon = 0.2$, wavelength $\lambda=0.1$, and learning rate $s = 0.01$, where the loss will decay and be bounded. (b) An example of divergent training based on the multiplicative noise case with amplitude $\epsilon = 0.2$, wavelength $\lambda=0.1$, and learning rate $s = 0.2$.
  • ...and 4 more figures