Table of Contents
Fetching ...

SISA: A Scale-In Systolic Array for GEMM Acceleration

Luigi Altamura, Alessio Cicero, Mateo Vázquez Maceiras, Mohammad Ali Maleki, Pedro Trancoso

Abstract

The currently dominant AI/ML workloads, such as Large Language Models (LLMs), rely on the efficient execution of General Matrix-Matrix Multiplication (GEMM) operations. Thus, most systems are equipped with dedicated matrix hardware accelerators based on square Systolic Arrays (SAs) of Processing Elements (PEs). While this organization was effective for traditional Deep Neural Networks (DNNs), LLMs introduce input-dependent and highly skewed matrices, leading to underutilized SA resources. To address this challenge, we propose SISA (Scale-In Systolic Array), a novel SA architecture that partitions the traditional square array into horizontal rectangular slabs. With minimal overhead, SISA exposes parallelism through independently scheduled slabs for efficient execution of small or skewed matrix shapes, while retaining full-array operation for large GEMMs. SISA achieves up to 8.52x speedup and 93% energy-delay-product (EDP) reduction for representative LLMs compared to a state-of-the-art monolithic SA with the same number of PEs.

SISA: A Scale-In Systolic Array for GEMM Acceleration

Abstract

The currently dominant AI/ML workloads, such as Large Language Models (LLMs), rely on the efficient execution of General Matrix-Matrix Multiplication (GEMM) operations. Thus, most systems are equipped with dedicated matrix hardware accelerators based on square Systolic Arrays (SAs) of Processing Elements (PEs). While this organization was effective for traditional Deep Neural Networks (DNNs), LLMs introduce input-dependent and highly skewed matrices, leading to underutilized SA resources. To address this challenge, we propose SISA (Scale-In Systolic Array), a novel SA architecture that partitions the traditional square array into horizontal rectangular slabs. With minimal overhead, SISA exposes parallelism through independently scheduled slabs for efficient execution of small or skewed matrix shapes, while retaining full-array operation for large GEMMs. SISA achieves up to 8.52x speedup and 93% energy-delay-product (EDP) reduction for representative LLMs compared to a state-of-the-art monolithic SA with the same number of PEs.

Paper Structure

This paper contains 23 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: (a) Prompt length derived from interactive chatbot requests and (b) alternative SA designs.
  • Figure 2: Overview of the SISA architecture with detailed views of the memory hierarchy and slab fusion mechanism.
  • Figure 3: Tiling and execution strategies in SISA for different GEMM shapes. (a) Independent slab execution for small-$M$ GEMMs, distributing tiles across slabs along $N$. (b) Slab fusion for intermediate-$M$ GEMMs, forming larger logical SA matching tile height. (c) Monolithic execution for large-$M$ GEMMs using the full array. (d) Slab-level power-gating disables slabs when parallelism is limited.
  • Figure 4: Speedup of SISA compared to TPU. Higher is better.
  • Figure 5: Normalized EDP of SISA compared to TPU. Lower is better.
  • ...and 2 more figures