Table of Contents
Fetching ...

A Theory of Synaptic Neural Balance: From Local to Global Order

Pierre Baldi, Antonios Alexos, Ian Domingo, Alireza Rahmansetayesh

TL;DR

A general theory of synaptic neural balance is developed and Simulations show that balancing neurons prior to learning, or during learning in alternation with gradient descent steps, can improve learning speed and final performance.

Abstract

We develop a general theory of synaptic neural balance and how it can emerge or be enforced in neural networks. For a given regularizer, a neuron is said to be in balance if the total cost of its input weights is equal to the total cost of its output weights. The basic example is provided by feedforward networks of ReLU units trained with $L_2$ regularizers, which exhibit balance after proper training. The theory explains this phenomenon and extends it in several directions. The first direction is the extension to bilinear and other activation functions. The second direction is the extension to more general regularizers, including all $L_p$ regularizers. The third direction is the extension to non-layered architectures, recurrent architectures, convolutional architectures, as well as architectures with mixed activation functions. Gradient descent on the error function alone does not converge in general to a balanced state, where every neuron is in balance, even when starting from a balanced state. However, gradient descent on the regularized error function ought to converge to a balanced state, and thus network balance can be used to assess learning progress. The theory is based on two local neuronal operations: scaling which is commutative, and balancing which is not commutative. Given any initial set of weights, when local balancing operations are applied to each neuron in a stochastic manner, global order always emerges through the convergence of the stochastic balancing algorithm to the same unique set of balanced weights. The reason for this is the existence of an underlying strictly convex optimization problem where the relevant variables are constrained to a linear, only architecture-dependent, manifold. Simulations show that balancing neurons prior to learning, or during learning in alternation with gradient descent steps, can improve learning speed and final performance.

A Theory of Synaptic Neural Balance: From Local to Global Order

TL;DR

A general theory of synaptic neural balance is developed and Simulations show that balancing neurons prior to learning, or during learning in alternation with gradient descent steps, can improve learning speed and final performance.

Abstract

We develop a general theory of synaptic neural balance and how it can emerge or be enforced in neural networks. For a given regularizer, a neuron is said to be in balance if the total cost of its input weights is equal to the total cost of its output weights. The basic example is provided by feedforward networks of ReLU units trained with regularizers, which exhibit balance after proper training. The theory explains this phenomenon and extends it in several directions. The first direction is the extension to bilinear and other activation functions. The second direction is the extension to more general regularizers, including all regularizers. The third direction is the extension to non-layered architectures, recurrent architectures, convolutional architectures, as well as architectures with mixed activation functions. Gradient descent on the error function alone does not converge in general to a balanced state, where every neuron is in balance, even when starting from a balanced state. However, gradient descent on the regularized error function ought to converge to a balanced state, and thus network balance can be used to assess learning progress. The theory is based on two local neuronal operations: scaling which is commutative, and balancing which is not commutative. Given any initial set of weights, when local balancing operations are applied to each neuron in a stochastic manner, global order always emerges through the convergence of the stochastic balancing algorithm to the same unique set of balanced weights. The reason for this is the existence of an underlying strictly convex optimization problem where the relevant variables are constrained to a linear, only architecture-dependent, manifold. Simulations show that balancing neurons prior to learning, or during learning in alternation with gradient descent steps, can improve learning speed and final performance.
Paper Structure (38 sections, 15 theorems, 45 equations, 19 figures, 7 tables)

This paper contains 38 sections, 15 theorems, 45 equations, 19 figures, 7 tables.

Key Result

Proposition 2.2

The class of additively linear activation functions is exactly equal to the class of linear activation functions, i.e., activation functions of the form $f(x)=ax$.

Figures (19)

  • Figure 1: BiPU activation functions (Bi-Power-Units) as described in Equation \ref{['eq:BiPU']}
  • Figure 2: A path with three hidden BiLU units connecting one input unit to one output unit. During the application of the stochastic balancing algorithm, at time $t$ each unit $i$ has a cumulative scaling factor $\Lambda_i(t)$, and each directed edge from unit $j$ to unit $i$ has a scaling factor $M_{ij}(t)= \Lambda_i(t)/\Lambda_j(t)$. The $\lambda_i(t)$ must remain within a finite closed interval away from 0 and infinity. To see this, imagine for instance that there is a subsequence of $\Lambda_3(t)$ that approaches 0. Then there must be a corresponding subsequence of $\Lambda_4(t)$ that approaches 0, or else the contribution of the weight $w_{43}\Lambda_4(t)/\Lambda_3(t)$ to the regularizer would go to infinity. But then, as we reach the output layer, the contribution of the last weight $w_{54}\Lambda_5(t)/\Lambda_4(t)$ to the regularizer goes to infinity because $\Lambda_5(t)$ is fixed to 1 and cannot compensate for the small values of $\Lambda_4(t)$. And similarly, if there is a subsequence of $\Lambda_3(t)$ going to infinity, we obtain a contradiction by propagating its effect towards the input layer.
  • Figure 3: A path with five units. After the stochastic balancing algorithm has converged, each unit $i$ has a scaling factor $\Lambda_i$, and each directed edge from unit $j$ to unit $i$ has a scaling factor $M_{ij}= \Lambda_i/\Lambda_j$. The products of the $M_{ij}$'s along the path is given by: $\frac{\Lambda_2}{\Lambda_1} \frac{\Lambda_3}{\Lambda_2} \frac{\Lambda_4}{\Lambda_3} \frac{\Lambda_5}{\Lambda_4}=\frac{\Lambda_5}{\Lambda_1}$. Accordingly, if we sum the variables $L_{ij} = \log M_{ij}$ along the directed path, we get $L_{21}+L_{32}+L_{43}+L_{54}=\log \Lambda_5 - \log \Lambda_1$. In particular, if unit 1 is an input unit and unit 5 is an output unit, we must have $\Lambda_1=\Lambda_5=1$ and thus: $L_{21}+L_{32}+L_{43}+L_{54}= 0$. Likewise, in the case of a directed cycle where unit 1 and unit 5 are the same, we must have: $L_{21}+L_{32}+L_{43}+L_{54}+ L_{15}= 0$.
  • Figure 4: Two hidden units (1 and 7) connected by two different directed paths 1-2-3-4-7 and 1-5-6-7 in a BiLU network. Each unit $i$ has a scaling factor $\Lambda_i$, and each directed edge from unit $j$ to unit $i$ has a scaling factor $M_{ij}= \Lambda_i/\Lambda_j$. The products of the $M_{ij}$'s along each path is equal to: $\frac{\Lambda_2}{\Lambda_1} \frac{\Lambda_3}{\Lambda_2} \frac{\Lambda_4}{\Lambda_3} \frac{\Lambda_7}{\Lambda_4}= \frac{\Lambda_5}{\Lambda_1} \frac{\Lambda_6}{\Lambda_5} \frac{\Lambda_7}{\Lambda_6}=\frac{\Lambda_7}{\Lambda_1}$. Therefore the variables $L_{ij}=\log M_{ij}$ must satisfy the linear equation: $L_{21}+L_{32}+L_{43}+L_{74}=L_{51}+L_{65}+L_{76}$ =$\log \Lambda_7- \log \Lambda_1$.
  • Figure 5: Consider two paths $\alpha+\beta$ and $\gamma + \delta$ from the input layer to the output layer going through the same unit $i$. Let us assume that the first path assigns a multiplier $\Lambda_i$ to unit $i$ and the second path assigns a multiplier $\Lambda'_i$ to the same unit. By assumption we must have: $\sum_\alpha L_{ij} + \sum_\beta L_{ij}=0$ for the first path, and $\sum_\gamma L_{ij} + \sum_\delta L_{ij}=0$. But $\alpha + \delta$ and $\gamma + \beta$ are also paths from the input layer to the output layer and therefore: $\sum_\alpha L_{ij} + \sum_\delta L_{ij}=0$ and $\sum_\gamma L_{ij} + \sum_\beta L_{ij}=0$. As a result, $\sum_\alpha L_{ij}=\log \Lambda_i=\sum_\gamma L_{ij}=\Lambda'_i$. Therefore the assignment of the multiplier $\Lambda_i$ must be consistent across different paths going through unit $i$.
  • ...and 14 more figures

Theorems & Definitions (61)

  • Definition 2.1
  • Proposition 2.2
  • proof
  • Definition 2.3
  • Proposition 2.4
  • proof
  • Definition 2.5
  • Proposition 2.6
  • proof
  • Definition 2.7
  • ...and 51 more