MACO: Exploring GEMM Acceleration on a Loosely-Coupled Multi-core Processor

Bingcai Sui; Junzhong Shen; Caixia Sun; Junhui Wang; Zhong Zheng; Wei Guo

MACO: Exploring GEMM Acceleration on a Loosely-Coupled Multi-core Processor

Bingcai Sui, Junzhong Shen, Caixia Sun, Junhui Wang, Zhong Zheng, Wei Guo

TL;DR

MACO introduces a loosely-coupled multi-core architecture with per-core MMAEs to accelerate GEMM workloads, backed by a tile-based Matrix Processing Assist ISA and hardware support for data prefetching, locking, and predictive address translation. The design uses multi-level tiling, DMA-driven prefetch, and MTQ/STQ-based multi-process management to achieve high parallelism, reporting up to $1.1$ TFLOPS with $88\%$ efficiency and ~${90}\%$ per-core efficiency across several cores. Evaluations demonstrate favorable area/power metrics, scalable performance across 2–16 compute nodes, and competitive gains over state-of-the-art baselines and accelerators on DL benchmarks. These results indicate MACO’s potential to flexibly and efficiently handle large-scale GEMM workloads and GEMM+ integrations in future processor designs.

Abstract

General-purpose processor vendors have integrated customized accelerator in their products due to the widespread use of General Matrix-Matrix Multiplication (GEMM) kernels. However, it remains a challenge to further improve the flexibilityand scalability of these GEMM-enhanced processors to cater to the emerging large-scale GEMM workloads. In this paper we propose MACO, a novel loosely-coupled multi-core general-purpose architecture optimized for GEMM-related applications. To enhance the programmability and flexibility of MACO, the paper introduces a tile-based instruction set architecture. Additionally, the paper presents techniques such as hardware-assisted data prefetching and locking, and predictive address translation to further enhance the computational efficiency of MACO for GEMM workloads. The experimental results demonstrate that MACO exhibits good scalability, achieving an average computational efficiency of 90% across multiple cores. Furthermore, evaluations on state-of-the-art deep neural networks show that MACO can achieve up to 1.1 TFLOPS with 88% computational efficiency, indicating its adaptivity to deep learning workloads.

MACO: Exploring GEMM Acceleration on a Loosely-Coupled Multi-core Processor

TL;DR

TFLOPS with

efficiency and ~

per-core efficiency across several cores. Evaluations demonstrate favorable area/power metrics, scalable performance across 2–16 compute nodes, and competitive gains over state-of-the-art baselines and accelerators on DL benchmarks. These results indicate MACO’s potential to flexibly and efficiently handle large-scale GEMM workloads and GEMM+ integrations in future processor designs.

Abstract

Paper Structure (20 sections, 8 figures, 4 tables)

This paper contains 20 sections, 8 figures, 4 tables.

Introduction
Background
GEMM-enhanced CPUs
Mapping Tile GEMM algorithm on Systolic Arrays
Architectural Design of MACO
Overview
Matrix Processing Assist Instruction Set
Multi-process Management
Implementation Details of MACO
Predictive Address Translation
Mapping Real-world GEMM$^+$ Workloads on MACO
Evaluations
Experimental Settings
Experimental Results
Evaluations on Area and Power
...and 5 more sections

Figures (8)

Figure 1: Illustration of mapping tile GEMM algorithm on systolic array.
Figure 2: Overview of MACO architecture.
Figure 3: State transition diagram of an MTQ entry.
Figure 4: Basics of page table address prediction.
Figure 5: Mapping schemes of GEMM$^+$ workloads on MACO.
...and 3 more figures

MACO: Exploring GEMM Acceleration on a Loosely-Coupled Multi-core Processor

TL;DR

Abstract

MACO: Exploring GEMM Acceleration on a Loosely-Coupled Multi-core Processor

Authors

TL;DR

Abstract

Table of Contents

Figures (8)