Table of Contents
Fetching ...

Do Deep Neural Network Solutions Form a Star Domain?

Ankit Sonthalia, Alexander Rubinstein, Ehsan Abbasnejad, Seong Joon Oh

TL;DR

The Starlight algorithm is proposed that finds a star model of a given learning task and is validated by showing that this star model is linearly connected with other independently found solutions.

Abstract

It has recently been conjectured that neural network solution sets reachable via stochastic gradient descent (SGD) are convex, considering permutation invariances (Entezari et al., 2022). This means that a linear path can connect two independent solutions with low loss, given the weights of one of the models are appropriately permuted. However, current methods to test this theory often require very wide networks to succeed. In this work, we conjecture that more generally, the SGD solution set is a "star domain" that contains a "star model" that is linearly connected to all the other solutions via paths with low loss values, modulo permutations. We propose the Starlight algorithm that finds a star model of a given learning task. We validate our claim by showing that this star model is linearly connected with other independently found solutions. As an additional benefit of our study, we demonstrate better uncertainty estimates on the Bayesian Model Averaging over the obtained star domain. Further, we demonstrate star models as potential substitutes for model ensembles. Our code is available at https://github.com/aktsonthalia/starlight.

Do Deep Neural Network Solutions Form a Star Domain?

TL;DR

The Starlight algorithm is proposed that finds a star model of a given learning task and is validated by showing that this star model is linearly connected with other independently found solutions.

Abstract

It has recently been conjectured that neural network solution sets reachable via stochastic gradient descent (SGD) are convex, considering permutation invariances (Entezari et al., 2022). This means that a linear path can connect two independent solutions with low loss, given the weights of one of the models are appropriately permuted. However, current methods to test this theory often require very wide networks to succeed. In this work, we conjecture that more generally, the SGD solution set is a "star domain" that contains a "star model" that is linearly connected to all the other solutions via paths with low loss values, modulo permutations. We propose the Starlight algorithm that finds a star model of a given learning task. We validate our claim by showing that this star model is linearly connected with other independently found solutions. As an additional benefit of our study, we demonstrate better uncertainty estimates on the Bayesian Model Averaging over the obtained star domain. Further, we demonstrate star models as potential substitutes for model ensembles. Our code is available at https://github.com/aktsonthalia/starlight.
Paper Structure (36 sections, 11 equations, 12 figures, 6 tables, 1 algorithm)

This paper contains 36 sections, 11 equations, 12 figures, 6 tables, 1 algorithm.

Figures (12)

  • Figure 1: Starness of a star model vs source models. We plot the loss barriers $B(\theta^\star,\theta_h)$ between star models $\theta^\star$ and heldout models $\theta_h\in H$ at different numbers of source models $Z$ used for learning the star model $\theta^\star$ (orange points). The heldout set is disjoint with the source models: $H\cap Z=\emptyset$. We provide a reference point given by the loss barrier between two regular solutions $B(\theta_A,\theta_B)$ for $\theta_A,\theta_B\in S{}$ (blue plot). The error bars indicate one standard deviation across five held-out models $|H|=5$. Incorporating more source models $|Z|$ enables finding a better star model with a lower loss barrier against an arbitrary solution.
  • Figure 2: "Starness" vs. model width and depth. For starness vs. model width (left), we vary the width of a WideResNet (depth $22$) from $1\times$ to $8\times$. For starness vs. model depth, we vary the depth of a WideResNet (width $1\times$) from $22$ to $40$ layers. For each depth-width combination, we plot the loss barriers $B(\theta^\star,\theta_h)$ between star models $\theta^\star$ and heldout models $\theta_h\in H$ on the y-axis. As a reference point, we plot the barrier between two regular solutions $B(\theta_A,\theta_B)$, on the x-axis. The points are annotated with the corresponding widths or depths. Star models consistently enjoy better linear connections with regular models, than do the regular models amongst each other.
  • Figure 3: Loss barriers for star models. We interpolate between a star model $\theta^\star$ and regular models that are trained with SGD. There are two types of regular models, depending on whether they are used for finding the star model: source models $Z$ are used, and heldout models $H$ are not. Along the interpolation, we visualize the loss barrier by plotting the loss and accuracy values (orange curves). For these curves, $t=0$ corresponds to the star model $\theta^\star$. For reference, we plot the interpolation between two arbitrary regular models (blue curves). The error bands correspond to one standard deviation.
  • Figure 4: Bayesian model averaging. The star model was trained using 50 source models. The x-axis denotes the number of models sampled from the star domain for Bayesian model averaging or from the set of source models.
  • Figure 5: Training loss landscape across SGD models, Adam models, and SGD-induced star models. We plot test loss across different types of solutions in $S$. Our star model $\theta^\star$ ("star" in the plot) is constructed from a set of SGD-trained models $Z$. We note that the star model is well-connected with SGD solutions. There remains a loss barrier between the star model and Adam solutions, but it is significantly lower than the barrier among Adam solutions.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Conjecture 1
  • Conjecture 2