Table of Contents
Fetching ...

Tempus Core: Area-Power Efficient Temporal-Unary Convolution Core for Low-Precision Edge DLAs

Prabhu Vellaisamy, Harideep Nair, Thomas Kang, Yichen Ni, Haoyang Fan, Bin Qi, Jeff Chen, Shawn Blanton, John Paul Shen

TL;DR

Tempus Core presents a temporal-unary-binary convolution engine that can be dropped into NVIDIA's open-source NVDLA, delivering major gains in area and power efficiency while preserving dataflow compatibility. The approach leverages tub multipliers and 2s-unary encoding to exploit value sparsity and multi-cycle compute, achieving iso-area throughput improvements of up to $5\times$ (INT8) and $4\times$ (INT4) for a $16\times16$ PE array. Across post-synthesis and place-and-route analyses, Tempus Core achieves substantial reductions in area ($\sim59\%$) and power ($\sim15\%$) at the PCU level, with even larger gains at the array scale, and workload-aware latency demonstrates energy benefits in sparse-weight regimes. These results indicate that unary-based convolution units can be integrated into existing DLAs to enable significantly more area- and power-efficient edge AI inference.

Abstract

The increasing complexity of deep neural networks (DNNs) poses significant challenges for edge inference deployment due to resource and power constraints of edge devices. Recent works on unary-based matrix multiplication hardware aim to leverage data sparsity and low-precision values to enhance hardware efficiency. However, the adoption and integration of such unary hardware into commercial deep learning accelerators (DLA) remain limited due to processing element (PE) array dataflow differences. This work presents Tempus Core, a convolution core with highly scalable unary-based PE array comprising of tub (temporal-unary-binary) multipliers that seamlessly integrates with the NVDLA (NVIDIA's open-source DLA for accelerating CNNs) while maintaining dataflow compliance and boosting hardware efficiency. Analysis across various datapath granularities shows that for INT8 precision in 45nm CMOS, Tempus Core's PE cell unit (PCU) yields 59.3% and 15.3% reductions in area and power consumption, respectively, over NVDLA's CMAC unit. Considering a 16x16 PE array in Tempus Core, area and power improves by 75% and 62%, respectively, while delivering 5x and 4x iso-area throughput improvements for INT8 and INT4 precisions. Post-place and route analysis of Tempus Core's PCU shows that the 16x4 PE array for INT4 precision in 45nm CMOS requires only 0.017 mm^2 die area and consumes only 6.2mW of total power. We demonstrate that area-power efficient unary-based hardware can be seamlessly integrated into conventional DLAs, paving the path for efficient unary hardware for edge AI inference.

Tempus Core: Area-Power Efficient Temporal-Unary Convolution Core for Low-Precision Edge DLAs

TL;DR

Tempus Core presents a temporal-unary-binary convolution engine that can be dropped into NVIDIA's open-source NVDLA, delivering major gains in area and power efficiency while preserving dataflow compatibility. The approach leverages tub multipliers and 2s-unary encoding to exploit value sparsity and multi-cycle compute, achieving iso-area throughput improvements of up to (INT8) and (INT4) for a PE array. Across post-synthesis and place-and-route analyses, Tempus Core achieves substantial reductions in area () and power () at the PCU level, with even larger gains at the array scale, and workload-aware latency demonstrates energy benefits in sparse-weight regimes. These results indicate that unary-based convolution units can be integrated into existing DLAs to enable significantly more area- and power-efficient edge AI inference.

Abstract

The increasing complexity of deep neural networks (DNNs) poses significant challenges for edge inference deployment due to resource and power constraints of edge devices. Recent works on unary-based matrix multiplication hardware aim to leverage data sparsity and low-precision values to enhance hardware efficiency. However, the adoption and integration of such unary hardware into commercial deep learning accelerators (DLA) remain limited due to processing element (PE) array dataflow differences. This work presents Tempus Core, a convolution core with highly scalable unary-based PE array comprising of tub (temporal-unary-binary) multipliers that seamlessly integrates with the NVDLA (NVIDIA's open-source DLA for accelerating CNNs) while maintaining dataflow compliance and boosting hardware efficiency. Analysis across various datapath granularities shows that for INT8 precision in 45nm CMOS, Tempus Core's PE cell unit (PCU) yields 59.3% and 15.3% reductions in area and power consumption, respectively, over NVDLA's CMAC unit. Considering a 16x16 PE array in Tempus Core, area and power improves by 75% and 62%, respectively, while delivering 5x and 4x iso-area throughput improvements for INT8 and INT4 precisions. Post-place and route analysis of Tempus Core's PCU shows that the 16x4 PE array for INT4 precision in 45nm CMOS requires only 0.017 mm^2 die area and consumes only 6.2mW of total power. We demonstrate that area-power efficient unary-based hardware can be seamlessly integrated into conventional DLAs, paving the path for efficient unary hardware for edge AI inference.

Paper Structure

This paper contains 13 sections, 1 equation, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Quantization training accuracies achieved on different ImageNet CNNs for different integer-based precisions when compared to baseline FP32 precision jain2020trained. Results show minimal accuracy decrease with lower precisions.
  • Figure 2: An example dataflow of an INT4 tub multiplier. The INT4 tub multiplier take a 4-bit binary-encoded value and a single temporal-coded bitstream as inputs. For each "1" bit in the bit-serial temporal-coded input, the binary value is accumulated, producing the desired output result.
  • Figure 3: Overview of Tempus Core integration into NVDLA. In NVDLA, the convolution buffer (CB) stores both activation and weight values, which are fed into the Convolution Core (CC) consisting of the convolution sequence controller (CSC), the Convolution MAC (CMAC) unit (containing the $k \times n$ MAC array, output registers, and handshaking logic) and the convolution accumulator (CACC). In this work, CC is replaced by Tempus Core, which contains modified CSC and a PE cell unit (PCU) containing $k$tub-based PE cells replacing CMAC. Each PE cell consists of $n$tub-based multipliers. Additional handshaking logic to facilitate multi-cycle convolution operation is also present, as well as output registers to maintain functionality.
  • Figure 4: Post-synthesis total power consumption (left) and cell area utilization (right) in 45nm CMOS for the two different $16 \times 16$ designs, both for INT4 and INT8 precisions.
  • Figure 5: Post-synthesis area utilization and total power consumption in 45nm CMOS across entire CMAC and PCU units for different array widths ($16 \times n$) with n = 4, 16, and 32. CC refers to the CMAC unit in CC, and TC denotes the PCU inside Tempus Core. For a constant core configuration of 16 PE Cells (array height), number of multipliers are varied across INT8, INT4, and INT2 precisions. Percentage decrease in area and power consumption for INT8 are denoted by the red dotted arrows.
  • ...and 4 more figures