Enabling Efficient Hybrid Systolic Computation in Shared L1-Memory Manycore Clusters

Sergio Mazzola; Samuel Riedel; Luca Benini

Enabling Efficient Hybrid Systolic Computation in Shared L1-Memory Manycore Clusters

Sergio Mazzola, Samuel Riedel, Luca Benini

TL;DR

This work introduces a hybrid architecture that merges systolic dataflow with shared-L1-memory manycore clusters by using memory-mapped queues and two lightweight RISC-V extensions, Xqueue and QLR, to form reconfigurable systolic networks on Mem-Pool. The approach preserves general-purpose programmability while delivering the efficiency of systolic computation for regular kernels, demonstrated on a 256-core Mem-Pool system with a modest 6% area overhead. Across matmul, conv2d, cfft, and dotp, the hybrid model achieves up to 73% sustained utilization and up to 208 GOPS/W in 22 nm FDX at 600 MHz, with notable reductions in synchronization overhead and interconnect power. The results show that, for high-arithmetic-intensity workloads, the hybrid architecture significantly outperforms a purely shared-memory baseline, while remaining effective for low-arithmetic-intensity kernels through pure shared-memory execution, thereby bridging two major parallelism paradigms.

Abstract

Systolic arrays and shared-L1-memory manycore clusters are commonly used architectural paradigms that offer different trade-offs to accelerate parallel workloads. While the first excel with regular dataflow at the cost of rigid architectures and complex programming models, the second are versatile and easy to program but require explicit dataflow management and synchronization. This work aims at enabling efficient systolic execution on shared-L1-memory manycore clusters. We devise a flexible architecture where small and energy-efficient RISC-V cores act as the systolic array's processing elements (PEs) and can form diverse, reconfigurable systolic topologies through queues mapped in the cluster's shared memory. We introduce two low-overhead RISC-V ISA extensions for efficient systolic execution, namely Xqueue and Queue-linked registers (QLRs), which support queue management in hardware. The Xqueue extension enables single-instruction access to shared-memory-mapped queues, while QLRs allow implicit and autonomous access to them, relieving the cores of explicit communication instructions. We demonstrate Xqueue and QLRs in MemPool, an open-source shared-memory cluster with 256 PEs, and analyze the hybrid systolic-shared-memory architecture's trade-offs on several DSP kernels with diverse arithmetic intensity. For an area increase of just 6%, our hybrid architecture can double MemPool's compute unit utilization, reaching up to 73%. In typical conditions (TT/0.80V/25°C), in a 22 nm FDX technology, our hybrid architecture runs at 600 MHz with no frequency degradation and is up to 65% more energy efficient than the shared-memory baseline, achieving up to 208 GOPS/W, with up to 63% of power spent in the PEs.

Enabling Efficient Hybrid Systolic Computation in Shared L1-Memory Manycore Clusters

TL;DR

Abstract

Paper Structure (31 sections, 15 figures, 3 tables)

This paper contains 31 sections, 15 figures, 3 tables.

Introduction
Background & Related Work
Systolic Array Architectures
Coarse-grained Reconfigurable Architectures
Manycore & Shared-L1-Memory Clusters
Hybrid Architecture
Hybrid Architectural View
Software Systolic Emulation
Hardware Extensions
Xqueue
Queue-linked Registers
Hybrid Systolic-Shared-Memory Mem-Pool
Hybrid Architectural View
Xqueue
Queue-linked Registers
...and 16 more sections

Figures (15)

Figure 1: Qualitative Pareto front of the trade-off among flexibility, performance, energy efficiency, and programmability for massively parallel architectures. The two considered architectural templates are depicted with red and blue on the curve, along with the design space they covered throughout their evolution.
Figure 2: Example mapping of a purely systolic $N$$\times$$N$ 2D-mesh architecture (left) over a generic a $N^{2}$-core L1-shared cluster (right): the cluster's $N^{2}$ cores can be viewed as a $N$$\times$$N$ array of connected to their neighbors through memory-mapped queues.
Figure 3: Simplified code executed on the hybrid architecture's to implement a matrix multiplication. With the systolic Software emulation (left), the accesses the queues defined by the four queue_t structures through function calls encompassing explicit queue bookkeeping and access. The Xqueue hardware extension (middle) replaces such function calls with single instructions only requiring the queues' address. With the further extension (right), after a minimal set-up phase, communication is performed implicitly and queue instructions are totally elided.
Figure 4: Simplified overview of the Mem-Pool tile, on the left, and of the Xqueue-extended memory controller, on the right. In the hybrid view of Mem-Pool, each is a of the systolic array, and each bank reserves space for a memory-mapped queue handled by its queue manager.
Figure 5: Schematic of the Snitch core extended with four . tap into the register file ports and the instruction decoding logic to detect register accesses and intercept newly written data. They also have access to the scoreboard, to the register file write-back path, and to the .
...and 10 more figures

Enabling Efficient Hybrid Systolic Computation in Shared L1-Memory Manycore Clusters

TL;DR

Abstract

Enabling Efficient Hybrid Systolic Computation in Shared L1-Memory Manycore Clusters

Authors

TL;DR

Abstract

Table of Contents

Figures (15)