Table of Contents
Fetching ...

Commercial Evaluation of Zero-Skipping MAC Design for Bit Sparsity Exploitation in DL Inference

Harideep Nair, Prabhu Vellaisamy, Tsung-Han Lin, Perry Wang, Shawn Blanton, John Paul Shen

TL;DR

Problem: Efficient DL inference requires exploiting bit sparsity to reduce MAC cost. Method: OzMAC, an updated PRA-based MAC with an Oz-encoder, is evaluated on TSMC N5 across 4-16 bit precisions and 0.5-1.5 GHz; eight pretrained INT8 workloads reveal high bit sparsity. Key findings: OzMAC achieves up to 30% area, 70% power, and 46% energy improvements relative to binary MAC, with additional throughput benefits through frequency scaling; energy savings persist even when throughput is matched by frequency. Significance: This work demonstrates the commercial viability of zero-skipping MAC designs for DL accelerators, enabling lighter, more energy-efficient inference hardware, and suggests system-level validation in future work.

Abstract

General Matrix Multiply (GEMM) units, consisting of multiply-accumulate (MAC) arrays, perform bulk of the computation in deep learning (DL). Recent work has proposed a novel MAC design, Bit-Pragmatic (PRA), capable of dynamically exploiting bit sparsity. This work presents OzMAC (Omit-zero-MAC), a modified re-implementation of PRA, but extends beyond earlier works by performing rigorous post-synthesis evaluation against binary MAC design across multiple bitwidths and clock frequencies using TSMC N5 process node to assess commercial implementation potential. We demonstrate the existence of high bit sparsity in eight pretrained INT8 DL workloads and show that 8-bit OzMAC improves all three metrics of area, power, and energy significantly by 21%, 70%, and 28%, respectively. Similar improvements are achieved when scaling data precisions (4, 8, 16 bits) and clock frequencies (0.5 GHz, 1 GHz, 1.5 GHz). For the 8-bit OzMAC, scaling its frequency to normalize the throughput, it still achieves 30% improvement on both power and energy.

Commercial Evaluation of Zero-Skipping MAC Design for Bit Sparsity Exploitation in DL Inference

TL;DR

Problem: Efficient DL inference requires exploiting bit sparsity to reduce MAC cost. Method: OzMAC, an updated PRA-based MAC with an Oz-encoder, is evaluated on TSMC N5 across 4-16 bit precisions and 0.5-1.5 GHz; eight pretrained INT8 workloads reveal high bit sparsity. Key findings: OzMAC achieves up to 30% area, 70% power, and 46% energy improvements relative to binary MAC, with additional throughput benefits through frequency scaling; energy savings persist even when throughput is matched by frequency. Significance: This work demonstrates the commercial viability of zero-skipping MAC designs for DL accelerators, enabling lighter, more energy-efficient inference hardware, and suggests system-level validation in future work.

Abstract

General Matrix Multiply (GEMM) units, consisting of multiply-accumulate (MAC) arrays, perform bulk of the computation in deep learning (DL). Recent work has proposed a novel MAC design, Bit-Pragmatic (PRA), capable of dynamically exploiting bit sparsity. This work presents OzMAC (Omit-zero-MAC), a modified re-implementation of PRA, but extends beyond earlier works by performing rigorous post-synthesis evaluation against binary MAC design across multiple bitwidths and clock frequencies using TSMC N5 process node to assess commercial implementation potential. We demonstrate the existence of high bit sparsity in eight pretrained INT8 DL workloads and show that 8-bit OzMAC improves all three metrics of area, power, and energy significantly by 21%, 70%, and 28%, respectively. Similar improvements are achieved when scaling data precisions (4, 8, 16 bits) and clock frequencies (0.5 GHz, 1 GHz, 1.5 GHz). For the 8-bit OzMAC, scaling its frequency to normalize the throughput, it still achieves 30% improvement on both power and energy.
Paper Structure (9 sections, 3 figures, 5 tables)

This paper contains 9 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: OzMAC (based on PRA albericio2017bit) with example compute.
  • Figure 2: Energy consumption vs. % bit-sparsity. Green-shaded region depicts the sparsity regions for Table \ref{['tab:dnn_sparsity']} workloads.
  • Figure 3: Die area and power costs vs precision configurations.