Table of Contents
Fetching ...

Brain network science modelling of sparse neural networks enables Transformers and LLMs to perform as fully connected

Yingtao Zhang, Diego Cerretti, Jialin Zhao, Wenjing Wu, Ziheng Liao, Umberto Michieli, Carlo Vittorio Cannistraci

TL;DR

The paper tackles the high cost of training large neural networks by developing brain-inspired dynamic sparse training (DST) techniques. It introduces a BRF-based sparse initialization, a GPU-friendly node-based CH2-L3n link predictor, and a soft Cannistraci-Hebb training rule (CHTs) with a sigmoid density decay (CHTss) to balance exploration and exploitation. The approach yields ultra-sparse networks that can outperform fully connected baselines in MLPs at ~1% connectivity and Transformer/LLaMA-scale models at 5–30% connectivity, while maintaining competitive language modeling performance. These contributions—reduced computational complexity to $O(N^3)$ for CH regrowth, a matrix-based CH predictor, and brain-inspired topology initialization—offer a practical path to efficient, scalable sparse neural networks in both vision and language tasks.

Abstract

Dynamic sparse training (DST) can reduce the computational demands in ANNs, but faces difficulties in keeping peak performance at high sparsity levels. The Cannistraci-Hebb training (CHT) is a brain-inspired method for growing connectivity in DST. CHT leverages a gradient-free, topology-driven link regrowth, which has shown ultra-sparse (less than 1% connectivity) advantage across various tasks compared to fully connected networks. Yet, CHT suffers two main drawbacks: (i) its time complexity is $O(Nd^3)$ - N node network size, d node degree - restricting it to ultra-sparse regimes. (ii) it selects top link prediction scores, which is inappropriate for the early training epochs, when the network presents unreliable connections. Here, we design the first brain-inspired network model - termed bipartite receptive field (BRF) - to initialize the connectivity of sparse artificial neural networks. We further introduce a GPU-friendly matrix-based approximation of CH link prediction, reducing complexity to $O(N^3)$. We introduce the Cannistraci-Hebb training soft rule (CHTs), which adopts a flexible strategy for sampling connections in both link removal and regrowth, balancing the exploration and exploitation of network topology. Additionally, we integrate CHTs with a sigmoid gradual density decay (CHTss). Empirical results show that BRF offers performance advantages over previous network science models. Using 1% of connections, CHTs outperforms fully connected networks in MLP architectures on image classification tasks, compressing some networks to less than 30% of the nodes. Using 5% of the connections, CHTss outperforms fully connected networks in two Transformer-based machine translation tasks. Finally, at 30% connectivity, both CHTs and CHTss outperform other DST methods in language modeling task.

Brain network science modelling of sparse neural networks enables Transformers and LLMs to perform as fully connected

TL;DR

The paper tackles the high cost of training large neural networks by developing brain-inspired dynamic sparse training (DST) techniques. It introduces a BRF-based sparse initialization, a GPU-friendly node-based CH2-L3n link predictor, and a soft Cannistraci-Hebb training rule (CHTs) with a sigmoid density decay (CHTss) to balance exploration and exploitation. The approach yields ultra-sparse networks that can outperform fully connected baselines in MLPs at ~1% connectivity and Transformer/LLaMA-scale models at 5–30% connectivity, while maintaining competitive language modeling performance. These contributions—reduced computational complexity to for CH regrowth, a matrix-based CH predictor, and brain-inspired topology initialization—offer a practical path to efficient, scalable sparse neural networks in both vision and language tasks.

Abstract

Dynamic sparse training (DST) can reduce the computational demands in ANNs, but faces difficulties in keeping peak performance at high sparsity levels. The Cannistraci-Hebb training (CHT) is a brain-inspired method for growing connectivity in DST. CHT leverages a gradient-free, topology-driven link regrowth, which has shown ultra-sparse (less than 1% connectivity) advantage across various tasks compared to fully connected networks. Yet, CHT suffers two main drawbacks: (i) its time complexity is - N node network size, d node degree - restricting it to ultra-sparse regimes. (ii) it selects top link prediction scores, which is inappropriate for the early training epochs, when the network presents unreliable connections. Here, we design the first brain-inspired network model - termed bipartite receptive field (BRF) - to initialize the connectivity of sparse artificial neural networks. We further introduce a GPU-friendly matrix-based approximation of CH link prediction, reducing complexity to . We introduce the Cannistraci-Hebb training soft rule (CHTs), which adopts a flexible strategy for sampling connections in both link removal and regrowth, balancing the exploration and exploitation of network topology. Additionally, we integrate CHTs with a sigmoid gradual density decay (CHTss). Empirical results show that BRF offers performance advantages over previous network science models. Using 1% of connections, CHTs outperforms fully connected networks in MLP architectures on image classification tasks, compressing some networks to less than 30% of the nodes. Using 5% of the connections, CHTss outperforms fully connected networks in two Transformer-based machine translation tasks. Finally, at 30% connectivity, both CHTs and CHTss outperform other DST methods in language modeling task.

Paper Structure

This paper contains 55 sections, 16 equations, 12 figures, 20 tables, 1 algorithm.

Figures (12)

  • Figure 1: Illustration of the CHTs process. One training iteration follows the steps of (a1) $\rightarrow$ (b1) $\rightarrow$ (c1) $\rightarrow$ (c2) $\rightarrow$ (d1) $\rightarrow$ (e). (a1) Network initialization with each of the sandwich layers (bipartite networks connecting layers' input nodes to their output nodes) being a bipartite receptive field (BRF) network. (a2) BRF network representation with $r$ = 0. (b1) Link removal process. (b2) Formula for determining which links to remove. (c1) Removal of inactive neurons caused by link removal. (c2) Adjust and remove incomplete links caused by inactive neuron removal. (d1) Regrowth of links according to the CH2-L3 node-based soft rule. (d2) Detailed illustration of the CH2-L3 node-based soft rule. (e) Finished state of the network after one iteration. The next iteration repeats the steps (b1) - (e) from this finished state. $\tilde{A}$ indicates the removal set of the iteration and $A^*$ is the regrown set.
  • Figure 2: One-time Link Prediction Runtime Performance Evaluation of node-based and path-based methods across varying densities and network sizes. In (a), the network size is fixed at 1024 × 1024, while in (b), the density is fixed at 5%.
  • Figure 3: Comparison of link regrowth strategies in CHTs using a LLaMA-60M model trained on OpenWebText for 5000 steps. The left plot shows validation perplexity (lower is better), while the right plot reports the in-time over-parameterization (ITOP) rate, which measures the cumulative proportion of links activated during training. Results are presented for three strategies: Soft, Random, and Deterministic regrowth.
  • Figure 4: Cannistraci-Hebb epitopological rationale.CHA The figure illustrates an explanatory example of topological link prediction using the Cannistraci-Hebb epitopological rationale based on either L2 or L3 paths. The two black nodes represent the seed nodes whose unobserved interaction is to be assigned a likelihood score. White nodes denote the common neighbours (CNs) of the seed nodes at either L2 or L3 distance. Together, the set of CNs and the internal local community links (iLCL) constitute the local community. Different link types are color-coded: green for nLCLs, red for external local community links (eLCLs), and white for iLCLs. The L2 (path length 2) and L3 (path length 3) paths associated with the illustrated communities are highlighted. Notably, in artificial neural networks (ANNs), linear layers correspond to bipartite networks, which inherently support only L3 path predictions, as shown in Figure \ref{['fig:principle']}.
  • Figure 5: The adjacency matrix of the Bipartite Scale-Free (BSF) network model compared to those of the Bipartite Small-World (BSW) network, the Bipartite Receptive Field with fixed sampling (BRF$_f$), and the Bipartite Receptive field with uniform sampling (BRF$_u$) as parameters $\beta$ and $r$ vary between 0 and 1. a) The BSF model inherently forms a scale-free network characterized by a power-law distribution with $\gamma = 2.76$. b) As $\beta$ changes from 0 to 1, the network exhibits reduced clustering. It is important to note that when $\beta = 0$, the BSW model does not qualify as a small-world network. c) As $r$ increases towards $1$, the adjacency matrix becomes more random, while sampling the output neurons' degrees from a fixed or uniform distribution.
  • ...and 7 more figures