Table of Contents
Fetching ...

Generalization emerges from local optimization in a self-organized learning network

S. Barland, L. Gil

TL;DR

This work designs and analyzes a new paradigm for building supervised learning networks, driven only by local optimization rules without relying on a global error function, and makes it possible to rethink the grokking transition in a new light.

Abstract

We design and analyze a new paradigm for building supervised learning networks, driven only by local optimization rules without relying on a global error function. Traditional neural networks with a fixed topology are made up of identical nodes and derive their expressiveness from an appropriate adjustment of connection weights. In contrast, our network stores new knowledge in the nodes accurately and instantaneously, in the form of a lookup table. Only then is some of this information structured and incorporated into the network geometry. The training error is initially zero by construction and remains so throughout the network topology transformation phase. The latter involves a small number of local topological transformations, such as splitting or merging of nodes and adding binary connections between them. The choice of operations to be carried out is only driven by optimization of expressivity at the local scale. What we are primarily looking for in a learning network is its ability to generalize, i.e. its capacity to correctly answer questions for which it has never learned the answers. We show on numerous examples of classification tasks that the networks generated by our algorithm systematically reach such a state of perfect generalization when the number of learned examples becomes sufficiently large. We report on the dynamics of the change of state and show that it is abrupt and has the distinctive characteristics of a first order phase transition, a phenomenon already observed for traditional learning networks and known as grokking. In addition to proposing a non-potential approach for the construction of learning networks, our algorithm makes it possible to rethink the grokking transition in a new light, under which acquisition of training data and topological structuring of data are completely decoupled phenomena.

Generalization emerges from local optimization in a self-organized learning network

TL;DR

This work designs and analyzes a new paradigm for building supervised learning networks, driven only by local optimization rules without relying on a global error function, and makes it possible to rethink the grokking transition in a new light.

Abstract

We design and analyze a new paradigm for building supervised learning networks, driven only by local optimization rules without relying on a global error function. Traditional neural networks with a fixed topology are made up of identical nodes and derive their expressiveness from an appropriate adjustment of connection weights. In contrast, our network stores new knowledge in the nodes accurately and instantaneously, in the form of a lookup table. Only then is some of this information structured and incorporated into the network geometry. The training error is initially zero by construction and remains so throughout the network topology transformation phase. The latter involves a small number of local topological transformations, such as splitting or merging of nodes and adding binary connections between them. The choice of operations to be carried out is only driven by optimization of expressivity at the local scale. What we are primarily looking for in a learning network is its ability to generalize, i.e. its capacity to correctly answer questions for which it has never learned the answers. We show on numerous examples of classification tasks that the networks generated by our algorithm systematically reach such a state of perfect generalization when the number of learned examples becomes sufficiently large. We report on the dynamics of the change of state and show that it is abrupt and has the distinctive characteristics of a first order phase transition, a phenomenon already observed for traditional learning networks and known as grokking. In addition to proposing a non-potential approach for the construction of learning networks, our algorithm makes it possible to rethink the grokking transition in a new light, under which acquisition of training data and topological structuring of data are completely decoupled phenomena.
Paper Structure (23 sections, 3 equations, 9 figures, 4 tables)

This paper contains 23 sections, 3 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Network before (a) and after (b) topological evolution for the $N-$bits parity problem with $N=3$. There are $4$ interface nodes ($3$ input, $1$ output), $1$ hidden node for (a) and $2$ for (b). There are $5$ TDS, each associated with a line of the various tables involved. The notation $i_{A}^{p}$ is used to designate the $p$ th input connection to $A$, $o_{A}^{q}$ for its $q$ th outgoing connection. As an example of how to read figure (a), we consider the case of the first (resp. $2$nd) training sample: the top left node sends $0$ (resp. $1$) , the middle left node $0$ (resp. $0$) and the bottom left node $1$ (resp. $0$). The central node then receives the code $0,0,1$ (resp. $1,0,0$). According to the node's LUT, the corresponding output code is $1$ (resp. $1$), which is retrieved by the last node. Figure (b) reads the same way, from left to right, layer by layer. It is important to note that networks (a) and (b) are strictly equivalent for all $5$ training samples, but that network (b) is able to provide a correct answer to all the $2^N$ cases, including the three which are not part of the training samples (cf. text).
  • Figure 2: Network configuration before (top) and after (bottom) the merging $(A,B) \longrightarrow C$. For node $A$, the left-hand $p$ and right-hand $q$ numbers overlapping the connections, label the incoming and outgoing links noted $i_{A}^{p}$ and respectively $o_{A}^{q}$ in the text. For the sake of clarity, we have used here a simplified notation. It is also important to note that all the pairs of nodes can not be grouped. This is the case, for example, of nodes $B$ and $E$ in the top figure.
  • Figure 3: Schematic representation of the splitting procedure $C \longrightarrow (A,B)$. The diagram (a) shows the initial node $C$ surrounded by its 6 inputs ($i_{C}^{p}, p \in [1,6]$) and 4 outputs ($o_{C}^{q}, q \in [1,4]$) connections. Plot (b) display the configuration after the splitting procedure. Note the extra link between $A$ and $B$ on the blue background, labeled $i_{B}^{1}$ or $o_{A}^{4}$ depending on whether it's considered as an ingoing or outcoming connection. This link carries the extra bits shown on the blue background in tab.\ref{['tab2']}
  • Figure 4: Table of input codes for nodes $A$ and $B$ after $m=5$ training data learnings. (a): Numbers from $1$ to $5$ correspond to the projection on $A$ and $B$ of the $C$ input codes for the $5$ training data. Because of $1$ and $4$ (in red on (a)), the LUT of $B$ does not define an application. We correct this by introducing an additional link $o_{A}^{4}=i_{B}^{1}$ (elements involved on a gray background in (b)). Numbers from $6$ to $13$ then correspond to consistent (although perhaps incorrect) responses from the pair ($A,B$) for input codes unknown to $C$.
  • Figure 5: A typical network configuration obtained after network topology evolution (a). The insert (b) corresponds to the initial configuration. Node $A$: The dotted red arrows represent the additional connections created by node $A$ on receiving an unknown CIC. Nodes $B$, $C$ and $D$: examples of configurations to be removed during the cleaning operation.
  • ...and 4 more figures