Table of Contents
Fetching ...

Axon: A novel systolic array architecture for improved run time and energy efficient GeMM and Conv operation with on-chip im2col

Md Mizanur Rahaman Nayan, Ritik Raj, Gouse Basha Shaik, Tushar Krishna, Azad J Naeemi

TL;DR

Axon introduces a novel in-array data orchestration for systolic arrays that feeds operands through the principal diagonal with bidirectional propagation, achieving up to 2× run-time reductions under all dataflows (OS/IS/WS) and workloads. It couples this with a lightweight im2col hardware unit built from 2-to-1 multiplexers to exploit input feature map reuse, drastically reducing off-chip memory traffic while incurring minimal area and power overhead. The architecture includes a unified PE design and zero-gating to leverage sparsity, and demonstrates substantial performance and energy benefits on GEMM and convolution workloads, including YOLOv3 and ResNet-50, with on-chip memory traffic reductions exceeding 60%. ASIC-level synthesis on 16×16 and larger configurations shows competitive area and power, with average speedups around 1.5–1.8× across workloads and favorable comparisons to CMSA and Sauria in utilization and energy efficiency. Overall, Axon provides a practical, scalable path to faster, more energy-efficient GeMM and Conv execution through simple, on-chip data orchestration and lightweight convolution lowering.

Abstract

General matrix multiplication (GeMM) is a core operation in virtually all AI applications. Systolic array (SA) based architectures have shown great promise as GeMM hardware accelerators thanks to their speed and energy efficiency. Unfortunately, SAs incur a linear delay in filling the operands, due to unidirectional propagation via pipeline latches. In this work, we propose a novel in-array data orchestration technique in SAs where we enable data feeding on the principal diagonal followed by bi-directional propagation. This improves the runtime by up to 2X at minimal hardware overhead. In addition, the proposed data orchestration enables convolution lowering (known as im2col) using a simple hardware support to fully exploit input feature map reuse opportunity and significantly lower the off-chip memory traffic resulting in 1.2X throughput improvement and 2.17X inference energy reduction during YOLOv3 and RESNET50 workload on average. In contrast, conventional data orchestration would require more elaborate hardware and control signals to implement im2col in hardware because of the data skew. We have synthesized and conducted place and route for 16X16 systolic arrays based on the novel and conventional orchestrations using ASAP 7nm PDK and found that our proposed approach results in 0.211% area and 1.6% power overheads.

Axon: A novel systolic array architecture for improved run time and energy efficient GeMM and Conv operation with on-chip im2col

TL;DR

Axon introduces a novel in-array data orchestration for systolic arrays that feeds operands through the principal diagonal with bidirectional propagation, achieving up to 2× run-time reductions under all dataflows (OS/IS/WS) and workloads. It couples this with a lightweight im2col hardware unit built from 2-to-1 multiplexers to exploit input feature map reuse, drastically reducing off-chip memory traffic while incurring minimal area and power overhead. The architecture includes a unified PE design and zero-gating to leverage sparsity, and demonstrates substantial performance and energy benefits on GEMM and convolution workloads, including YOLOv3 and ResNet-50, with on-chip memory traffic reductions exceeding 60%. ASIC-level synthesis on 16×16 and larger configurations shows competitive area and power, with average speedups around 1.5–1.8× across workloads and favorable comparisons to CMSA and Sauria in utilization and energy efficiency. Overall, Axon provides a practical, scalable path to faster, more energy-efficient GeMM and Conv execution through simple, on-chip data orchestration and lightweight convolution lowering.

Abstract

General matrix multiplication (GeMM) is a core operation in virtually all AI applications. Systolic array (SA) based architectures have shown great promise as GeMM hardware accelerators thanks to their speed and energy efficiency. Unfortunately, SAs incur a linear delay in filling the operands, due to unidirectional propagation via pipeline latches. In this work, we propose a novel in-array data orchestration technique in SAs where we enable data feeding on the principal diagonal followed by bi-directional propagation. This improves the runtime by up to 2X at minimal hardware overhead. In addition, the proposed data orchestration enables convolution lowering (known as im2col) using a simple hardware support to fully exploit input feature map reuse opportunity and significantly lower the off-chip memory traffic resulting in 1.2X throughput improvement and 2.17X inference energy reduction during YOLOv3 and RESNET50 workload on average. In contrast, conventional data orchestration would require more elaborate hardware and control signals to implement im2col in hardware because of the data skew. We have synthesized and conducted place and route for 16X16 systolic arrays based on the novel and conventional orchestrations using ASAP 7nm PDK and found that our proposed approach results in 0.211% area and 1.6% power overheads.
Paper Structure (21 sections, 3 equations, 15 figures, 3 tables)

This paper contains 21 sections, 3 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Data feeder from the buffer and in-array dataflow in conventional systolic array with PE's architecture
  • Figure 2: a) Scale up (left) b) Scale out (right). In scale up one large monolithic array is used whereas in scale out multiple systolic arrays are used to generate output.
  • Figure 3: a) Axon's in-array data orchestration. Thick semitransparent arrows indicate data movement into the systolic array from buffers and thin solid arrows indicate data movement inside the array among the PEs. The same colors on PEs represent operands' arrival at the same cycle whereas PEs on the principal diagonal receive the operands on the first cycle directly from the buffers. b) im2col implementation. Note that, each MUX allows feeder PEs to receive data either from buffer or from immediate PE on the diagonal.
  • Figure 4: Simple $3\times3$ GeMM example that validates Axon data orchestration. Partial products labeled with the same colors indicate that they are generated in the same cycle. All operands are fed to SA through PE on the principal diagonal.
  • Figure 5: Axon data orchestration in the rectangular systolic array. The columns that do not have any PEs on the principal diagonal will be fed through PEs at the bottom of the array with zero padding based on the distance. Notice third column is fed with zero-padded by one and the fourth column is fed with zero-padded by two.
  • ...and 10 more figures