Table of Contents
Fetching ...

Hybrid Temporal Computing for Lower Power Hardware Accelerators

Maliha Tasnim, Sachin Sachdeva, Yibo Liu, Sheldon X. -D. Tan

TL;DR

The paper tackles the escalating energy demands of modern computing by introducing Hybrid Temporal Computing (HTC), a framework that fuses temporal (delay-based) and pulse-rate encodings to enable general-purpose, ultra-low-power hardware accelerators. HTC redefines multiplication through temporal-regulated bitstreams and addition via SC-like scaled operations, while propagating data temporally to minimize switching. The authors implement a 4-input HTC MAC and demonstrate HTC-based FIR and DCT/iDCT accelerators, achieving substantial gains over Unary and CBSC baselines in power and area, and in some cases near-CBSC accuracy with much lower energy. This work offers a practical path to energy-efficient DSP/AI accelerators by leveraging hybrid encoding and deterministic temporal processing, with potential impact on embedded systems and edge computing.

Abstract

In this paper, we propose a new hybrid temporal computing (HTC) framework that leverages both pulse rate and temporal data encoding to design ultra-low energy hardware accelerators. Our approach is inspired by the recently proposed temporal computing, or race logic, which encodes data values as single delays, leading to significantly lower energy consumption due to minimized signal switching. However, race logic is limited in its applications due to inherent restrictions. The new HTC framework overcomes these limitations by encoding signals in both temporal and pulse rate formats for multiplication and in temporal format for propagation. This approach maintains reduced switch energy while being general enough to implement a wide range of arithmetic operations. We demonstrate how HTC multiplication is performed for both unipolar and bipolar data encoding and present the basic designs for multipliers, adders, and MAC units. Additionally, we implement two hardware accelerators: a Finite Impulse Response (FIR) filter and a Discrete Cosine Transform (DCT)/iDCT engine for image compression and DSP applications. Experimental results show that the HTC MAC has a significantly smaller power and area footprint compared to the Unary MAC design and is orders of magnitude faster. Compared to the CBSC MAC, the HTC MAC reduces power consumption by $45.2\%$ and area footprint by $50.13\%$. For the FIR design, the HTC design significantly outperforms the Unary design on all metrics. Compared to the CBSC design, the HTC-based FIR filter reduces power consumption by $36.61\%$ and area cost by $45.85\%$. The HTC-based DCT filter retains the quality of the original image with a decent PSNR, while consuming $23.34\%$ less power and occupying $18.20\%$ less area than the CBSC MAC-based DCT filter.

Hybrid Temporal Computing for Lower Power Hardware Accelerators

TL;DR

The paper tackles the escalating energy demands of modern computing by introducing Hybrid Temporal Computing (HTC), a framework that fuses temporal (delay-based) and pulse-rate encodings to enable general-purpose, ultra-low-power hardware accelerators. HTC redefines multiplication through temporal-regulated bitstreams and addition via SC-like scaled operations, while propagating data temporally to minimize switching. The authors implement a 4-input HTC MAC and demonstrate HTC-based FIR and DCT/iDCT accelerators, achieving substantial gains over Unary and CBSC baselines in power and area, and in some cases near-CBSC accuracy with much lower energy. This work offers a practical path to energy-efficient DSP/AI accelerators by leveraging hybrid encoding and deterministic temporal processing, with potential impact on embedded systems and edge computing.

Abstract

In this paper, we propose a new hybrid temporal computing (HTC) framework that leverages both pulse rate and temporal data encoding to design ultra-low energy hardware accelerators. Our approach is inspired by the recently proposed temporal computing, or race logic, which encodes data values as single delays, leading to significantly lower energy consumption due to minimized signal switching. However, race logic is limited in its applications due to inherent restrictions. The new HTC framework overcomes these limitations by encoding signals in both temporal and pulse rate formats for multiplication and in temporal format for propagation. This approach maintains reduced switch energy while being general enough to implement a wide range of arithmetic operations. We demonstrate how HTC multiplication is performed for both unipolar and bipolar data encoding and present the basic designs for multipliers, adders, and MAC units. Additionally, we implement two hardware accelerators: a Finite Impulse Response (FIR) filter and a Discrete Cosine Transform (DCT)/iDCT engine for image compression and DSP applications. Experimental results show that the HTC MAC has a significantly smaller power and area footprint compared to the Unary MAC design and is orders of magnitude faster. Compared to the CBSC MAC, the HTC MAC reduces power consumption by and area footprint by . For the FIR design, the HTC design significantly outperforms the Unary design on all metrics. Compared to the CBSC design, the HTC-based FIR filter reduces power consumption by and area cost by . The HTC-based DCT filter retains the quality of the original image with a decent PSNR, while consuming less power and occupying less area than the CBSC MAC-based DCT filter.
Paper Structure (13 sections, 1 equation, 8 figures, 3 tables)

This paper contains 13 sections, 1 equation, 8 figures, 3 tables.

Figures (8)

  • Figure 1: As illustrated, in race logic, X indicates 2 (2 delay from the origin time), Y=3, X'=4, O is min (X' and Y) which is 3. A is max of (X' and Y), which is 4. INHIBIT means if Y arrives earlier than X/X', then I=Y, otherwise I = unchanged temporal_computing_nist.
  • Figure 2: (a) Traditional stochastic multiplication; (b) The counting-based stochastic (CBSC) multiplication SimLee:DAC'17Yu:DAC'21
  • Figure 3: The regulated bitstream representation of (a) 3 bit unipolar binary data 011 and (b) 3 bit bipolar binary data 110. Here, $X_s$ is the sign bit.
  • Figure 4: The multiplication and addition in HTC framework
  • Figure 5: The proposed HTC multiple-input MAC architecture
  • ...and 3 more figures