Table of Contents
Fetching ...

Exploring the Performance Improvement of Tensor Processing Engines through Transformation in the Bit-weight Dimension of MACs

Qizhe Wu, Huawen Liang, Yuchen Gui, Zhichen Zeng, Zerong He, Linfeng Tao, Xiaotian Wang, Letian Zhao, Zhaoxi Zeng, Wei Yuan, Wei Wu, Xi Jin

TL;DR

The paper addresses the performance bottlenecks of tensor processing engines by focusing on the bit-weight dimension (BW) of MACs. It introduces a compute-centric BW-based notation and four orthogonal optimizations (OPT1–OPT4) to uncover and exploit BW-driven parallelism and sparsity in partial products. RTL-based evaluation at 28nm across four classic TPE architectures shows consistent area and energy improvements, with substantial gains in bit-slice benchmarks (e.g., up to 12.10× energy efficiency vs Laconic). The work demonstrates that encoding-based sparsity and selective reduction can dramatically boost density and frequency, offering a practical path to higher-throughput, energy-efficient AI accelerators for DNNs and LLMs. The provided Verilog code and hardware reports support reproducibility and future exploration of BW-aware TPE design.

Abstract

General matrix-matrix multiplication (GEMM) is a cornerstone of AI computations, making tensor processing engines (TPEs) increasingly critical in GPUs and domain-specific architectures. Existing architectures primarily optimize dataflow or operand reuse strategies. However, considering the interaction between matrix multiplication and multiply-accumulators (MACs) offers greater optimization potential. This work introduces a novel hardware perspective on matrix multiplication, focusing on the bit-weight dimension of MACs. We propose a finer-grained TPE notation using matrix triple loops as an example, introducing new methods for designing and optimizing PE microarchitectures. Based on this notation and its transformations, we propose four optimization techniques that improve timing, area, and power consumption. Implementing our design in RTL using the SMIC-28nm process, we evaluate its effectiveness across four classic TPE architectures: systolic array, 3D-Cube, multiplier-adder tree, and 2D-Matrix. Our techniques achieve area efficiency improvements of 1.27x, 1.28x, 1.56x, and 1.44x, and energy efficiency gains of 1.04x, 1.56x, 1.49x, and 1.20x, respectively. Applied to a bit-slice architecture, our approach achieves a 12.10x improvement in energy efficiency and 2.85x in area efficiency compared to Laconic. Our Verilog HDL code, along with timing, area, and power reports, is available at https://github.com/wqzustc/High-Performance-Tensor-Processing-Engines

Exploring the Performance Improvement of Tensor Processing Engines through Transformation in the Bit-weight Dimension of MACs

TL;DR

The paper addresses the performance bottlenecks of tensor processing engines by focusing on the bit-weight dimension (BW) of MACs. It introduces a compute-centric BW-based notation and four orthogonal optimizations (OPT1–OPT4) to uncover and exploit BW-driven parallelism and sparsity in partial products. RTL-based evaluation at 28nm across four classic TPE architectures shows consistent area and energy improvements, with substantial gains in bit-slice benchmarks (e.g., up to 12.10× energy efficiency vs Laconic). The work demonstrates that encoding-based sparsity and selective reduction can dramatically boost density and frequency, offering a practical path to higher-throughput, energy-efficient AI accelerators for DNNs and LLMs. The provided Verilog code and hardware reports support reproducibility and future exploration of BW-aware TPE design.

Abstract

General matrix-matrix multiplication (GEMM) is a cornerstone of AI computations, making tensor processing engines (TPEs) increasingly critical in GPUs and domain-specific architectures. Existing architectures primarily optimize dataflow or operand reuse strategies. However, considering the interaction between matrix multiplication and multiply-accumulators (MACs) offers greater optimization potential. This work introduces a novel hardware perspective on matrix multiplication, focusing on the bit-weight dimension of MACs. We propose a finer-grained TPE notation using matrix triple loops as an example, introducing new methods for designing and optimizing PE microarchitectures. Based on this notation and its transformations, we propose four optimization techniques that improve timing, area, and power consumption. Implementing our design in RTL using the SMIC-28nm process, we evaluate its effectiveness across four classic TPE architectures: systolic array, 3D-Cube, multiplier-adder tree, and 2D-Matrix. Our techniques achieve area efficiency improvements of 1.27x, 1.28x, 1.56x, and 1.44x, and energy efficiency gains of 1.04x, 1.56x, 1.49x, and 1.20x, respectively. Applied to a bit-slice architecture, our approach achieves a 12.10x improvement in energy efficiency and 2.85x in area efficiency compared to Laconic. Our Verilog HDL code, along with timing, area, and power reports, is available at https://github.com/wqzustc/High-Performance-Tensor-Processing-Engines

Paper Structure

This paper contains 28 sections, 8 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: The microarchitecture of INT MAC and MM unit.
  • Figure 2: Improvements in microarchitecture compared to other works. (A) Traditional MAC (TPU-Like). (B) and (C) Bit-serial-based computation methods. (D) Optimized MAC. (E) and (F) Optimized bit-serial architectures. (G) Similarities and differences with floating-point optimized schemes. Without showing the DFFs, only Step ❸ includes a pipeline register, while the other steps are single-cycle operations.
  • Figure 3: Example of multiplication based on encoding.
  • Figure 4: The GEMM loop from the PE microarchitecture perspective.
  • Figure 5: The proposed optimization architecture 1 (OPT1).
  • ...and 9 more figures