Table of Contents
Fetching ...

Efficient Nonlinear Function Approximation in Analog Resistive Crossbars for Recurrent Neural Networks

Junyi Yang, Ruibin Mao, Mingrui Jiang, Yichuan Cheng, Pao-Sheng Vincent Sun, Shuai Dong, Giacomo Pedretti, Xia Sheng, Jim Ignowski, Haoliang Li, Can Li, Arindam Basu

TL;DR

This work experimentally demonstrate the implementation of a non-linear activation function integrated with a ramp analog-to-digital conversion at the periphery of the memory to improve in-memory implementation of recurrent neural networks.

Abstract

Analog In-memory Computing (IMC) has demonstrated energy-efficient and low latency implementation of convolution and fully-connected layers in deep neural networks (DNN) by using physics for computing in parallel resistive memory arrays. However, recurrent neural networks (RNN) that are widely used for speech-recognition and natural language processing have tasted limited success with this approach. This can be attributed to the significant time and energy penalties incurred in implementing nonlinear activation functions that are abundant in such models. In this work, we experimentally demonstrate the implementation of a non-linear activation function integrated with a ramp analog-to-digital conversion (ADC) at the periphery of the memory to improve in-memory implementation of RNNs. Our approach uses an extra column of memristors to produce an appropriately pre-distorted ramp voltage such that the comparator output directly approximates the desired nonlinear function. We experimentally demonstrate programming different nonlinear functions using a memristive array and simulate its incorporation in RNNs to solve keyword spotting and language modelling tasks. Compared to other approaches, we demonstrate manifold increase in area-efficiency, energy-efficiency and throughput due to the in-memory, programmable ramp generator that removes digital processing overhead.

Efficient Nonlinear Function Approximation in Analog Resistive Crossbars for Recurrent Neural Networks

TL;DR

This work experimentally demonstrate the implementation of a non-linear activation function integrated with a ramp analog-to-digital conversion at the periphery of the memory to improve in-memory implementation of recurrent neural networks.

Abstract

Analog In-memory Computing (IMC) has demonstrated energy-efficient and low latency implementation of convolution and fully-connected layers in deep neural networks (DNN) by using physics for computing in parallel resistive memory arrays. However, recurrent neural networks (RNN) that are widely used for speech-recognition and natural language processing have tasted limited success with this approach. This can be attributed to the significant time and energy penalties incurred in implementing nonlinear activation functions that are abundant in such models. In this work, we experimentally demonstrate the implementation of a non-linear activation function integrated with a ramp analog-to-digital conversion (ADC) at the periphery of the memory to improve in-memory implementation of RNNs. Our approach uses an extra column of memristors to produce an appropriately pre-distorted ramp voltage such that the comparator output directly approximates the desired nonlinear function. We experimentally demonstrate programming different nonlinear functions using a memristive array and simulate its incorporation in RNNs to solve keyword spotting and language modelling tasks. Compared to other approaches, we demonstrate manifold increase in area-efficiency, energy-efficiency and throughput due to the in-memory, programmable ramp generator that removes digital processing overhead.

Paper Structure

This paper contains 14 sections, 18 equations, 20 figures, 19 tables, 1 algorithm.

Figures (20)

  • Figure 1: Limitation of current In-memory computing (IMC) for Recurrent Neural Networks and our proposed solution. a A survey of DNN accelerators show the improvement in energy efficiency offered by IMC over digital architectures. However, the improvement does not extend to recurrent neural networks (RNN) such as LSTM and there exists a gap in energy efficiency between RNNs and feedforward architectures. Details of the surveyed papers available heresurvey. b Architecture of a LSTM cell showing a large number of nonlinear (NL) activations such as sigmoid and hyperbolic tangent which are absent in feedforward architectures that mostly use simple nonlinearities like rectified linear unit (ReLU). c Digital implementation of the NL operations causes a bottleneck in latency and energy efficiency since the linear operations are highly efficient in time and energy usage due to inherent parallelism of IMC. For a LSTM layer with $512$ hidden unit and with $k=32$ parallel digital processors for the NL operations, the NL operations still take $2-5$X longer time for execution due to the need of multiple clock cycles ($N_{cyc}$) per NL activation. d Our proposed solution creates an In-memory analog to digital converter (ADC) that combines NL activation with digitization of the dot product between input and weight vectors.
  • Figure 2: Overview of in-memory nonlinear ADC.a The concept of traditional ramp-based ADC. b The schematic and timing of in-memory computing circuits with embedded nonlinear activation function generation. c The Inverse of the sigmoid function illustrates the shape of the required ramp voltage. d The value of each step of the ramp voltage $V_{ramp}$ denoted by $\Delta V_k$ is proportional to memristor conductances $G_{adc,k}$ used to program the nonlinear ramp voltage. The desired conductances for a 5-bit implementation of a sigmoid nonlinear activation is shown. e Comparison of used cell numbers between 5-bit and 4-bit in-SRAM with 5-bit in-RRAM nonlinear function. The RRAM-based nonlinear function has an approximation error between the two SRAM-based ones due to write noise while using a smaller area due to its compact size.
  • Figure 3: Experimentally demonstrated NL-ADC on crossbar arraysa Calibration process for accurate NL-ADC programming. The left panel shows the ramp function of the ideal case, programming without bias calibration and with bias calibration. The case with bias calibration shows better INL performance. The right panel shows the actual conductance mapping on the crossbar arrays on two blocks of $8$ arbitrary selected columns. The lower $5$ conductances are for bias calibration while the top $32$ are for the ramp generation. We show the cases when mapping of NL-ADC weights doesn't have stuck-at-OFF devices and low programming error (left block), and the cases which have stuck-at-OFF devices and high programming error (right block). The results show that both cases can be calibrated by the additional 5 memristors. b Robustness of our proposed in-memory NL-ADC under $V_{\text{read}}$ variations. We sweep the $V_{\text{read}}$ from 0.15V to 0.25V to simulate noise induced variations in read voltage. Normal ADC has large variations while our in-memory NL-ADC can track the $V_{\text{read}}$.
  • Figure 4: | LSTM for KWS task.a Architecture of LSTM network on-chip inference. b Mapping of LSTM network onto the chip. Weights and nonlinearities (Sigmoid and Tanh) of LSTM layer are programmed crossbar arrays as conductance. Input and output (I/O) data of LSTM layer are sent from/to the integrated chip through off-chip circuits. c Weight conductance distribution curve and error. d The measured inference accuracy results obtained on the chip are compared with the software baseline using the ideal model, as well as simulation results under different bit NL-ADC models and hardware-measured weight noise. e Energy efficiency and area efficiency comparison: our LSTM IC, conventional ADC model and recently published LSTM ICs from research papersyue20197kadetotad20208shin201714yin20171conti2018chipmunknatIBM64coreNature2023analogAIchipIBMjouppi2017datacenter. Energy efficiency and throughput under 8 bit,
  • Figure 5: | LSTM for NLP task.a Architecture of LSTM network for on-chip inference in character prediction task. b Comparison in the LSTM layer between the number of neurons and operations per timestep in the NLP model for character prediction and the KWS model. c Simulation results under different bit resolution of NL-ADC models and hardware-measured weight noise compared with software baseline using the ideal model. BPC results follow the "smaller is better" principle, meaning that lower values indicate better performance. d Energy efficiency and area efficiency comparison: our LSTM IC, conventional ADC model and recently published LSTM ICs from research papersyue20197kadetotad20208shin201714yin20171conti2018chipmunknatIBM64coreNature2023analogAIchipIBMjouppi2017datacenter. Detailed calculation of energy efficiency and throughput for both macro and system levels are shown in \ref{['supsec:Estimation_energy_area_latency_Macro']}, \ref{['supsec:Estimation_energy_area_latency_system']}and Tab. \ref{['Tab:tab3 different ADC bit Comparison NLP']}. Area efficiency of all works are normalized to 1 GHz clock and 16 nm CMOS process. e Energy efficiency and throughput comparison: our LSTM IC, conventional ADC model and recently published LSTM ICs from research papersyue20197kadetotad20208shin201714yin20171conti2018chipmunknatIBM64coreNature2023analogAIchipIBMjouppi2017datacenter.
  • ...and 15 more figures