Table of Contents
Fetching ...

tubGEMM: Energy-Efficient and Sparsity-Effective Temporal-Unary-Binary Based Matrix Multiply Unit

Prabhu Vellaisamy, Harideep Nair, Joseph Finn, Manav Trivedi, Albert Chen, Anna Li, Tsung-Han Lin, Perry Wang, Shawn Blanton, John Paul Shen

TL;DR

tubGEMM introduces an exact GEMM unit that uses hybrid temporal-unary and binary encoding (twos-Unary) to achieve substantial area, power, and energy savings over prior unary designs, while leveraging dynamic value sparsity for runtime latency and energy reductions. The architecture features a structured PE array, a tailored dataflow with index counters and vector generators, and a temporal-unary encoder that preserves precision. Hardware results across 45nm and 5nm CMOS show strong reductions relative to uGEMM, with scalable performance as matrix size and bit-width vary, and real DNN workloads (MobileNetv2, ResNet-50) demonstrating significant energy and EDP improvements. Overall, tubGEMM offers a practical path to energy-efficient, exact GEMM at edge scales, enabling low-power inference and potential online learning on resource-constrained devices.

Abstract

General Matrix Multiplication (GEMM) is a ubiquitous compute kernel in deep learning (DL). To support energy-efficient edge-native processing, new GEMM hardware units have been proposed that operate on unary encoded bitstreams using much simpler hardware. Most unary approaches thus far focus on rate-based unary encoding of values and perform stochastic approximate computation. This work presents tubGEMM, a novel matrix-multiply unit design that employs hybrid temporal-unary and binary (tub) encoding and performs exact (not approximate) GEMM. It intrinsically exploits dynamic value sparsity to improve energy efficiency. Compared to the current best unary design uGEMM, tubGEMM significantly reduces area, power, and energy by 89\%, 87\%, and 50\%, respectively. A tubGEMM design performing 128x128 matrix multiply on 8-bit integers, in commercial TSMC N5 (5nm) process node, consumes just 0.22 mm^2 die area, 417.72 mW power, and 8.86 uJ energy, assuming no sparsity. Typical sparsity in DL workloads (MobileNetv2, ResNet-50) reduces energy by more than 3x, and lowering precision to 4 and 2 bits further reduces it by 24x and 104x respectively.

tubGEMM: Energy-Efficient and Sparsity-Effective Temporal-Unary-Binary Based Matrix Multiply Unit

TL;DR

tubGEMM introduces an exact GEMM unit that uses hybrid temporal-unary and binary encoding (twos-Unary) to achieve substantial area, power, and energy savings over prior unary designs, while leveraging dynamic value sparsity for runtime latency and energy reductions. The architecture features a structured PE array, a tailored dataflow with index counters and vector generators, and a temporal-unary encoder that preserves precision. Hardware results across 45nm and 5nm CMOS show strong reductions relative to uGEMM, with scalable performance as matrix size and bit-width vary, and real DNN workloads (MobileNetv2, ResNet-50) demonstrating significant energy and EDP improvements. Overall, tubGEMM offers a practical path to energy-efficient, exact GEMM at edge scales, enabling low-power inference and potential online learning on resource-constrained devices.

Abstract

General Matrix Multiplication (GEMM) is a ubiquitous compute kernel in deep learning (DL). To support energy-efficient edge-native processing, new GEMM hardware units have been proposed that operate on unary encoded bitstreams using much simpler hardware. Most unary approaches thus far focus on rate-based unary encoding of values and perform stochastic approximate computation. This work presents tubGEMM, a novel matrix-multiply unit design that employs hybrid temporal-unary and binary (tub) encoding and performs exact (not approximate) GEMM. It intrinsically exploits dynamic value sparsity to improve energy efficiency. Compared to the current best unary design uGEMM, tubGEMM significantly reduces area, power, and energy by 89\%, 87\%, and 50\%, respectively. A tubGEMM design performing 128x128 matrix multiply on 8-bit integers, in commercial TSMC N5 (5nm) process node, consumes just 0.22 mm^2 die area, 417.72 mW power, and 8.86 uJ energy, assuming no sparsity. Typical sparsity in DL workloads (MobileNetv2, ResNet-50) reduces energy by more than 3x, and lowering precision to 4 and 2 bits further reduces it by 24x and 104x respectively.

Paper Structure

This paper contains 17 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Rate-Unary Encoding vs. Temporal-Unary Encoding vs. 3-Bit Binary Encoding for the value '5'
  • Figure 2: 4x4 tubGEMM Architectural Block Diagram
  • Figure 3: tubGEMM components
  • Figure 4: uGEMM vs. tubGEMM 45nm post-synthesis WC PPA and energy values for bipolar and unipolar non-scaled GEMM
  • Figure 5: tubGEMM TSMC N5 PPA and energy scaling, for input matrix dimensions of 16x16, 32x32, 64x64, and 128x128, and matrix element value bit-widths of 8 bits, 4 bits, and 2 bits
  • ...and 1 more figures