Learnable Sparsification of Die-to-Die Communication via Spike-Based Encoding
Joshua Nardone, Ruijie Zhu, Joseph Callenes, Mohammed E. Elbtity, Ramtin Zand, Jason Eshraghian
TL;DR
The paper addresses bandwidth bottlenecks in large AI systems by introducing heterogeneous neural networks (HNNs) that place spike-based inter-chip communication at bandwidth-constrained die boundaries while retaining dense ANN computation within chips. The proposed SW/HW co-design leverages a 2-D NoC with EMIO and a CLP converter to enable learnable sparsity for die-to-die data, combining energy-efficient spike-based communication with high-throughput dense processing. Across language and vision benchmarks, HNNs achieve up to 5.3× energy efficiency and 15.2× latency benefits, while maintaining competitive accuracy relative to purely ANN or SNN baselines and showing scalability as models grow. This approach offers a practical pathway to scalable, energy-efficient AI accelerators by strategically balancing sparse inter-chip signaling with dense intra-chip computation, paving the way for larger and more power-efficient deployments of next-generation models.
Abstract
Efficient communication is central to both biological and artificial intelligence (AI) systems. In biological brains, the challenge of long-range communication across regions is addressed through sparse, spike-based signaling, minimizing energy and latency. Conversely, modern AI workloads are increasingly constrained by bandwidth, leading to bottlenecks that hamper scalability and efficiency. Inspired by the brain's ability to execute dynamic and complex local computations coupled with sparse inter-neuron communication, we propose heterogeneous neural networks that combine spiking neural networks (SNNs) and artificial neural networks (ANNs) at bandwidth-limited regions, such as chip boundaries, where spike-based communication reduces data transfer overhead. Within each chip, dense ANN computations maintain high throughput, accuracy, and robustness. While SNNs have struggled to algorithmically scale, our approach surmounts this long-standing challenge through algorithm-architecture co-design where learnable sparsity is employed for die-to-die communication by confining spiking layers to specific partitions. This composable design combines high ANN performance with low-bandwidth SNN efficiency. Evaluations on language processing and computer vision exhibit up to 5.3x energy efficiency gains and 15.2x latency reductions, surpassing both purely spiking and non-spiking models. As model size grows, improvements scale accordingly. By targeting the inter-chip communication bottleneck with biologically inspired methods, this approach presents a promising path to more efficient AI systems.
