Table of Contents
Fetching ...

DeMM: A Decoupled Matrix Multiplication Engine Supporting Relaxed Structured Sparsity

Christodoulos Peltekis, Vasileios Titopoulos, Chrysostomos Nicopoulos, Giorgos Dimitrakopoulos

TL;DR

DeMM introduces a decoupled matrix-multiplication engine that natively supports relaxed structured sparsity patterns in sparse–dense matrix products. By disaggregating memory from MAC units and pre-loading matrix $B$ into a multi-read-port memory, DeMM processes rows of a relaxed-sparse $A$ using a row-wise, col_idx-driven data flow, with reconfigurability to denser $kN$:$M$ patterns. Empirical evaluations on ResNet50 and related CNNs show substantial latency and energy advantages over state-of-the-art engines (VEGETA, S2TA, SPOTS) across relaxed sparsity and fine-grained sparsity regimes, while maintaining comparable area and significantly reducing power. The results indicate practical benefits for mobile/deployed DL workloads requiring flexible sparsity handling without the complexity of traditional systolic arrays.

Abstract

Deep Learning (DL) has achieved unprecedented success in various application domains. Meanwhile, model pruning has emerged as a viable solution to reduce the footprint of DL models in mobile applications, without compromising their accuracy. To enable the matrix engines built for dense DL models to also handle their pruned counterparts, pruned DL models follow a fine-grained structured sparsity pattern of 1:4, or 2:4, whereby in each group of four contiguous values, at least one, or two, respectively, must be non-zero. Structured sparsity has recently also moved to coarser (relaxed) cases of N:128, or N:256, for small values of N, targeting a wider range of sparsity (10%-90%) for the DL models. In this work, we design an accelerator that operates, by construction, on wide blocks with relaxed structured sparsity. In contrast to the conventional systolic array archetype, the new engine decouples the memory part of the systolic array from the multiply-add units. The memory block comprises 1 write and N read ports, with the number of read ports being equal to the number of non-zero elements per row. The multiply-add units connect directly to each read port and complete the multiplication in a row-wise product-first order. More importantly, simple reconfiguration facilitates more dense patterns. The experimental evaluation demonstrates substantial latency improvements over current state-of-the-art systolic array engines built for fine-grained and relaxed structured sparsity.

DeMM: A Decoupled Matrix Multiplication Engine Supporting Relaxed Structured Sparsity

TL;DR

DeMM introduces a decoupled matrix-multiplication engine that natively supports relaxed structured sparsity patterns in sparse–dense matrix products. By disaggregating memory from MAC units and pre-loading matrix into a multi-read-port memory, DeMM processes rows of a relaxed-sparse using a row-wise, col_idx-driven data flow, with reconfigurability to denser : patterns. Empirical evaluations on ResNet50 and related CNNs show substantial latency and energy advantages over state-of-the-art engines (VEGETA, S2TA, SPOTS) across relaxed sparsity and fine-grained sparsity regimes, while maintaining comparable area and significantly reducing power. The results indicate practical benefits for mobile/deployed DL workloads requiring flexible sparsity handling without the complexity of traditional systolic arrays.

Abstract

Deep Learning (DL) has achieved unprecedented success in various application domains. Meanwhile, model pruning has emerged as a viable solution to reduce the footprint of DL models in mobile applications, without compromising their accuracy. To enable the matrix engines built for dense DL models to also handle their pruned counterparts, pruned DL models follow a fine-grained structured sparsity pattern of 1:4, or 2:4, whereby in each group of four contiguous values, at least one, or two, respectively, must be non-zero. Structured sparsity has recently also moved to coarser (relaxed) cases of N:128, or N:256, for small values of N, targeting a wider range of sparsity (10%-90%) for the DL models. In this work, we design an accelerator that operates, by construction, on wide blocks with relaxed structured sparsity. In contrast to the conventional systolic array archetype, the new engine decouples the memory part of the systolic array from the multiply-add units. The memory block comprises 1 write and N read ports, with the number of read ports being equal to the number of non-zero elements per row. The multiply-add units connect directly to each read port and complete the multiplication in a row-wise product-first order. More importantly, simple reconfiguration facilitates more dense patterns. The experimental evaluation demonstrates substantial latency improvements over current state-of-the-art systolic array engines built for fine-grained and relaxed structured sparsity.
Paper Structure (9 sections, 1 equation, 8 figures)

This paper contains 9 sections, 1 equation, 8 figures.

Figures (8)

  • Figure 1: Examples of (a) unstructured sparsity; (b) structured block sparsity of 1:4 (i.e., up to 1 non-zero element in every 4 consecutive elements); and (c) relaxed structured sparsity 4:16, and the corresponding packed representation of the non-zero elements. A blue square indicates a non-zero element.
  • Figure 2: Read and multiply operations are sufficient to perform matrix multiplication when the sparse matrix contains at most one non-zero element per row.
  • Figure 3: Multiplying a sparse matrix with at most two non-zero elements per row requires two separate memory read ports and two rows of multipliers. The products of each port are then independently added in parallel to form the final result of the output row.
  • Figure 4: The overall organization of a DeMM engine that supports relaxed structured sparsity of 4:64 using a memory block of four read ports and four multipliers and one add-reduction unit per output element. The example assumes that 64 outputs (columns) are computed in parallel.
  • Figure 5: The overall architecture of the DeMM engine that supports an $N$:$M$ relaxed structured sparsity and can be reconfigured for all $kN$:$M$ denser variants.
  • ...and 3 more figures