Enabling Efficient Hybrid Systolic Computation in Shared L1-Memory Manycore Clusters
Sergio Mazzola, Samuel Riedel, Luca Benini
TL;DR
This work introduces a hybrid architecture that merges systolic dataflow with shared-L1-memory manycore clusters by using memory-mapped queues and two lightweight RISC-V extensions, Xqueue and QLR, to form reconfigurable systolic networks on Mem-Pool. The approach preserves general-purpose programmability while delivering the efficiency of systolic computation for regular kernels, demonstrated on a 256-core Mem-Pool system with a modest 6% area overhead. Across matmul, conv2d, cfft, and dotp, the hybrid model achieves up to 73% sustained utilization and up to 208 GOPS/W in 22 nm FDX at 600 MHz, with notable reductions in synchronization overhead and interconnect power. The results show that, for high-arithmetic-intensity workloads, the hybrid architecture significantly outperforms a purely shared-memory baseline, while remaining effective for low-arithmetic-intensity kernels through pure shared-memory execution, thereby bridging two major parallelism paradigms.
Abstract
Systolic arrays and shared-L1-memory manycore clusters are commonly used architectural paradigms that offer different trade-offs to accelerate parallel workloads. While the first excel with regular dataflow at the cost of rigid architectures and complex programming models, the second are versatile and easy to program but require explicit dataflow management and synchronization. This work aims at enabling efficient systolic execution on shared-L1-memory manycore clusters. We devise a flexible architecture where small and energy-efficient RISC-V cores act as the systolic array's processing elements (PEs) and can form diverse, reconfigurable systolic topologies through queues mapped in the cluster's shared memory. We introduce two low-overhead RISC-V ISA extensions for efficient systolic execution, namely Xqueue and Queue-linked registers (QLRs), which support queue management in hardware. The Xqueue extension enables single-instruction access to shared-memory-mapped queues, while QLRs allow implicit and autonomous access to them, relieving the cores of explicit communication instructions. We demonstrate Xqueue and QLRs in MemPool, an open-source shared-memory cluster with 256 PEs, and analyze the hybrid systolic-shared-memory architecture's trade-offs on several DSP kernels with diverse arithmetic intensity. For an area increase of just 6%, our hybrid architecture can double MemPool's compute unit utilization, reaching up to 73%. In typical conditions (TT/0.80V/25°C), in a 22 nm FDX technology, our hybrid architecture runs at 600 MHz with no frequency degradation and is up to 65% more energy efficient than the shared-memory baseline, achieving up to 208 GOPS/W, with up to 63% of power spent in the PEs.
