GTA: a new General Tensor Accelerator with Better Area Efficiency and Data Reuse

Chenyang Ai; Lechuan Zhao; Zhijie Huang; Cangyuan Li; Xinan Wang; Ying Wang

GTA: a new General Tensor Accelerator with Better Area Efficiency and Data Reuse

Chenyang Ai, Lechuan Zhao, Zhijie Huang, Cangyuan Li, Xinan Wang, Ying Wang

TL;DR

The paper tackles the inefficiency of existing general accelerators for multi-precision tensor operators. It introduces GTA, a General Tensor Accelerator that fuses a Multi-Precision Reconfigurable Array (MPRA) with a vector processing unit and leverages p-GEMM-based operator classification along with a scheduling space spanning dataflow WS/IS/OS and array resize to map workloads described by $M$, $N$, and $K$ to the hardware. Key contributions include the MPRA design, the GTA architecture, the scheduling framework, and a thorough evaluation against VPU, GPGPU, and CGRA showing substantial gains in memory efficiency and compute speed (e.g., up to 7.76x memory efficiency and up to 25.83x speedup over CGRA). This work advances tensor accelerator design by jointly optimizing data reuse, precision support, and hardware configurability, offering practical benefits for diverse domains such as ML, signal processing, and scientific computing.

Abstract

Recently, tensor algebra have witnessed significant applications across various domains. Each operator in tensor algebra features different computational workload and precision. However, current general accelerators, such as VPU, GPGPU, and CGRA, support tensor operators with low energy and area efficiency. This paper conducts an in-depth exploration of general accelerator for tensor processing. First, we find the similarity between matrix multiplication and precision multiplication, and create a method classifying tensor operators. Then, we implement two discoveries and introduce the systolic architecture into general-purpose accelerator. Therefore, we propose a new General Tensor Accelerator (GTA), which has a better area efficiency and data reuse. Furthermore, we create a large hardware scheduling space consisting of dataflow, precision and array resize. Our evaluation results demonstrate that GTA is able to achieves 7.76X, 5.35X, 8.76X memory efficiency and 6.45X, 3.39X, 25.83X speedup over of VPU, GPGPU and CGRA.

GTA: a new General Tensor Accelerator with Better Area Efficiency and Data Reuse

TL;DR

, and

to the hardware. Key contributions include the MPRA design, the GTA architecture, the scheduling framework, and a thorough evaluation against VPU, GPGPU, and CGRA showing substantial gains in memory efficiency and compute speed (e.g., up to 7.76x memory efficiency and up to 25.83x speedup over CGRA). This work advances tensor accelerator design by jointly optimizing data reuse, precision support, and hardware configurability, offering practical benefits for diverse domains such as ML, signal processing, and scientific computing.

Abstract

Paper Structure (21 sections, 10 figures, 3 tables)

This paper contains 21 sections, 10 figures, 3 tables.

Introduction
BACKGROUND
Specialized Accelerators and Systolic Array
Specialized Accelerators Become More and More General
INSIGHTS INTO MULTI-PRECISION TENSOR OPERATORS
Similarity between Matrix Multiplication and Precision Multiplication
Classification of Tensor Operators using p-GEMM and vector operators
HARDWARE ARCHITECTURE
Multi-Precision Reconfigurable Array
GTA Overall Architecture
SCHEDULING SPACE EXPLORING FOR p-GEMM OPERATOR
METHODOLOGY
Implementation
Workload
Baseline
...and 6 more sections

Figures (10)

Figure 1: The diagram of multi-precision matrix multiplication implement on the systolic array.
Figure 2: The indicative algorithmic parallelism and arithmetic intensity of some tensor operators.
Figure 3: Implementation of a 16-bit multi-precision accumulator.
Figure 4: Architecture overview: example of 16 lanes. (a) One row of MPRA cover various kinds of precision in WS mode. (b) $8\times 8$ MPRA operate multi-precision p-GEMM and vector operation. (c) The overview of GTA architecture (d) The MPRA combine the whole array with reconfigurable shape (e) The modified slide unit use mask bits to arrange the dataflow of array.
Figure 5: Dataflow pattern matching: 64 lanes and $64\times64$ size array example.
...and 5 more figures

GTA: a new General Tensor Accelerator with Better Area Efficiency and Data Reuse

TL;DR

Abstract

GTA: a new General Tensor Accelerator with Better Area Efficiency and Data Reuse

Authors

TL;DR

Abstract

Table of Contents

Figures (10)