Towards 3D Acceleration for low-power Mixture-of-Experts and Multi-Head Attention Spiking Transformers

Boxun Xu; Junyoung Hwang; Pruek Vanna-iampikul; Yuxuan Yin; Sung Kyu Lim; Peng Li

Towards 3D Acceleration for low-power Mixture-of-Experts and Multi-Head Attention Spiking Transformers

Boxun Xu, Junyoung Hwang, Pruek Vanna-iampikul, Yuxuan Yin, Sung Kyu Lim, Peng Li

TL;DR

The paper tackles the energy and latency challenges of large-scale spiking transformers that combine Mixture-of-Experts and self-attention by introducing the first dedicated 3D hardware accelerators. It presents two-tier, face-to-face bonded 3D architectures with memory-on-logic and logic-on-logic interconnections to exploit spatial and temporal parallelism, enabling efficient weight reuse across modular spiking experts and attention heads. workload-specific definitions for Spiking MoE and Spiking MHA are provided, along with kernel-fused, dataflow-optimized PE designs and top-tier global buffers to minimize data movement. Evaluations on CIFAR-10/100 show 3D designs achieving up to 39–41% area reductions, 14.4% power savings (MoE), and 15–30% memory-access latency reductions over 2D designs, with higher effective frequencies, demonstrating practical energy-efficient deployment paths for large-scale spiking MoE transformers. Overall, the work establishes a blueprint for scalable, low-power neuromorphic accelerators capable of handling brain-inspired MoE and MHA computations at scale.

Abstract

Spiking Neural Networks(SNNs) provide a brain-inspired and event-driven mechanism that is believed to be critical to unlock energy-efficient deep learning. The mixture-of-experts approach mirrors the parallel distributed processing of nervous systems, introducing conditional computation policies and expanding model capacity without scaling up the number of computational operations. Additionally, spiking mixture-of-experts self-attention mechanisms enhance representation capacity, effectively capturing diverse patterns of entities and dependencies between visual or linguistic tokens. However, there is currently a lack of hardware support for highly parallel distributed processing needed by spiking transformers, which embody a brain-inspired computation. This paper introduces the first 3D hardware architecture and design methodology for Mixture-of-Experts and Multi-Head Attention spiking transformers. By leveraging 3D integration with memory-on-logic and logic-on-logic stacking, we explore such brain-inspired accelerators with spatially stackable circuitry, demonstrating significant optimization of energy efficiency and latency compared to conventional 2D CMOS integration.

Towards 3D Acceleration for low-power Mixture-of-Experts and Multi-Head Attention Spiking Transformers

TL;DR

Abstract

Towards 3D Acceleration for low-power Mixture-of-Experts and Multi-Head Attention Spiking Transformers

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)