Table of Contents
Fetching ...

Learnable Sparsification of Die-to-Die Communication via Spike-Based Encoding

Joshua Nardone, Ruijie Zhu, Joseph Callenes, Mohammed E. Elbtity, Ramtin Zand, Jason Eshraghian

TL;DR

The paper addresses bandwidth bottlenecks in large AI systems by introducing heterogeneous neural networks (HNNs) that place spike-based inter-chip communication at bandwidth-constrained die boundaries while retaining dense ANN computation within chips. The proposed SW/HW co-design leverages a 2-D NoC with EMIO and a CLP converter to enable learnable sparsity for die-to-die data, combining energy-efficient spike-based communication with high-throughput dense processing. Across language and vision benchmarks, HNNs achieve up to 5.3× energy efficiency and 15.2× latency benefits, while maintaining competitive accuracy relative to purely ANN or SNN baselines and showing scalability as models grow. This approach offers a practical pathway to scalable, energy-efficient AI accelerators by strategically balancing sparse inter-chip signaling with dense intra-chip computation, paving the way for larger and more power-efficient deployments of next-generation models.

Abstract

Efficient communication is central to both biological and artificial intelligence (AI) systems. In biological brains, the challenge of long-range communication across regions is addressed through sparse, spike-based signaling, minimizing energy and latency. Conversely, modern AI workloads are increasingly constrained by bandwidth, leading to bottlenecks that hamper scalability and efficiency. Inspired by the brain's ability to execute dynamic and complex local computations coupled with sparse inter-neuron communication, we propose heterogeneous neural networks that combine spiking neural networks (SNNs) and artificial neural networks (ANNs) at bandwidth-limited regions, such as chip boundaries, where spike-based communication reduces data transfer overhead. Within each chip, dense ANN computations maintain high throughput, accuracy, and robustness. While SNNs have struggled to algorithmically scale, our approach surmounts this long-standing challenge through algorithm-architecture co-design where learnable sparsity is employed for die-to-die communication by confining spiking layers to specific partitions. This composable design combines high ANN performance with low-bandwidth SNN efficiency. Evaluations on language processing and computer vision exhibit up to 5.3x energy efficiency gains and 15.2x latency reductions, surpassing both purely spiking and non-spiking models. As model size grows, improvements scale accordingly. By targeting the inter-chip communication bottleneck with biologically inspired methods, this approach presents a promising path to more efficient AI systems.

Learnable Sparsification of Die-to-Die Communication via Spike-Based Encoding

TL;DR

The paper addresses bandwidth bottlenecks in large AI systems by introducing heterogeneous neural networks (HNNs) that place spike-based inter-chip communication at bandwidth-constrained die boundaries while retaining dense ANN computation within chips. The proposed SW/HW co-design leverages a 2-D NoC with EMIO and a CLP converter to enable learnable sparsity for die-to-die data, combining energy-efficient spike-based communication with high-throughput dense processing. Across language and vision benchmarks, HNNs achieve up to 5.3× energy efficiency and 15.2× latency benefits, while maintaining competitive accuracy relative to purely ANN or SNN baselines and showing scalability as models grow. This approach offers a practical pathway to scalable, energy-efficient AI accelerators by strategically balancing sparse inter-chip signaling with dense intra-chip computation, paving the way for larger and more power-efficient deployments of next-generation models.

Abstract

Efficient communication is central to both biological and artificial intelligence (AI) systems. In biological brains, the challenge of long-range communication across regions is addressed through sparse, spike-based signaling, minimizing energy and latency. Conversely, modern AI workloads are increasingly constrained by bandwidth, leading to bottlenecks that hamper scalability and efficiency. Inspired by the brain's ability to execute dynamic and complex local computations coupled with sparse inter-neuron communication, we propose heterogeneous neural networks that combine spiking neural networks (SNNs) and artificial neural networks (ANNs) at bandwidth-limited regions, such as chip boundaries, where spike-based communication reduces data transfer overhead. Within each chip, dense ANN computations maintain high throughput, accuracy, and robustness. While SNNs have struggled to algorithmically scale, our approach surmounts this long-standing challenge through algorithm-architecture co-design where learnable sparsity is employed for die-to-die communication by confining spiking layers to specific partitions. This composable design combines high ANN performance with low-bandwidth SNN efficiency. Evaluations on language processing and computer vision exhibit up to 5.3x energy efficiency gains and 15.2x latency reductions, surpassing both purely spiking and non-spiking models. As model size grows, improvements scale accordingly. By targeting the inter-chip communication bottleneck with biologically inspired methods, this approach presents a promising path to more efficient AI systems.
Paper Structure (21 sections, 10 equations, 13 figures, 4 tables)

This paper contains 21 sections, 10 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: An overview. Artificial neurons are integrated with spiking neurons, where $x_i$ denotes an input, $w_i$ denotes a weight, $H(\cdot)$ is a thresholding function, $u_t$ is the membrane potential of the spiking neuron at time $t$. Spiking neurons are positions at the peripheral of demarcation boundaries between systems (inter-chip communication is used in this paper). This enables sparse data communication using spike-based die-to-die interfaces, with dense ANN operations within cores.
  • Figure 2: 2-D Mesh NoC Hybrid Hardware Accelerator Overview. A high-level overview of the hybrid approach to solve the limitations of pure ANN or SNN accelerators. (a) Proposed Hybrid Architecture 2-D inter-die processing array. (b) Architecture overview showing the 2-D Mesh NoC SNN peripheral cores and ANN interior core grid with two unidirectional ports on each side connected through the EMIO. (c) SNN Core with proposed hybrid architecture's CLP converter. (d) ANN Core with proposed hybrid architecture's converter.
  • Figure 3: High Level Interconnect (single port). Flow of I/O at a high level. Asynchronous FIFO Buffers within the Merge and Split Blocks control incoming Spike Packets from the 8 peripheral cores & outgoing to the mapped 8 next chip peripheral cores through SerDes block.
  • Figure 4: Cross-Layer Activation-to-Spiking and Spiking-to-Activation Packet Converter Design.
  • Figure 5: Comparative Analysis of MS-ResNet Architectures: This diagram illustrates three distinct MS-ResNet architectures, highlighting the utilization of Batch Normalization (BN) and Layer Normalization (LN). BN is predominantly employed in computer vision tasks, while LN is more suited for language modeling tasks. The architecture variation extends to the layer types; convolutional layers (Conv) are used in computer vision tasks, whereas dense layers are more common for language modeling.
  • ...and 8 more figures