Table of Contents
Fetching ...

A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference

Manuel Le Gallo, Riduan Khaddam-Aljameh, Milos Stanisavljevic, Athanasios Vasilopoulos, Benedikt Kersting, Martino Dazzi, Geethan Karunaratne, Matthias Braendli, Abhairaj Singh, Silvia M. Mueller, Julian Buechel, Xavier Timoneda, Vinay Joshi, Urs Egger, Angelo Garofalo, Anastasios Petropoulos, Theodore Antonakopoulos, Kevin Brew, Samuel Choi, Injo Ok, Timothy Philip, Victor Chan, Claire Silvestre, Ishtiaq Ahsan, Nicole Saulnier, Vijay Narayanan, Pier Andrea Francese, Evangelos Eleftheriou, Abu Sebastian

TL;DR

A multicore AIMC chip designed and fabricated in 14 nm complementary metal–oxide–semiconductor technology with backend-integrated phase-change memory is reported, which demonstrates near-software-equivalent inference accuracy with ResNet and long short-term memory networks, while implementing all the computations associated with the weight layers and the activation functions on the chip.

Abstract

The need to repeatedly shuttle around synaptic weight values from memory to processing units has been a key source of energy inefficiency associated with hardware implementation of artificial neural networks. Analog in-memory computing (AIMC) with spatially instantiated synaptic weights holds high promise to overcome this challenge, by performing matrix-vector multiplications (MVMs) directly within the network weights stored on a chip to execute an inference workload. However, to achieve end-to-end improvements in latency and energy consumption, AIMC must be combined with on-chip digital operations and communication to move towards configurations in which a full inference workload is realized entirely on-chip. Moreover, it is highly desirable to achieve high MVM and inference accuracy without application-wise re-tuning of the chip. Here, we present a multi-core AIMC chip designed and fabricated in 14-nm complementary metal-oxide-semiconductor (CMOS) technology with backend-integrated phase-change memory (PCM). The fully-integrated chip features 64 256x256 AIMC cores interconnected via an on-chip communication network. It also implements the digital activation functions and processing involved in ResNet convolutional neural networks and long short-term memory (LSTM) networks. We demonstrate near software-equivalent inference accuracy with ResNet and LSTM networks while implementing all the computations associated with the weight layers and the activation functions on-chip. The chip can achieve a maximal throughput of 63.1 TOPS at an energy efficiency of 9.76 TOPS/W for 8-bit input/output matrix-vector multiplications.

A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference

TL;DR

A multicore AIMC chip designed and fabricated in 14 nm complementary metal–oxide–semiconductor technology with backend-integrated phase-change memory is reported, which demonstrates near-software-equivalent inference accuracy with ResNet and long short-term memory networks, while implementing all the computations associated with the weight layers and the activation functions on the chip.

Abstract

The need to repeatedly shuttle around synaptic weight values from memory to processing units has been a key source of energy inefficiency associated with hardware implementation of artificial neural networks. Analog in-memory computing (AIMC) with spatially instantiated synaptic weights holds high promise to overcome this challenge, by performing matrix-vector multiplications (MVMs) directly within the network weights stored on a chip to execute an inference workload. However, to achieve end-to-end improvements in latency and energy consumption, AIMC must be combined with on-chip digital operations and communication to move towards configurations in which a full inference workload is realized entirely on-chip. Moreover, it is highly desirable to achieve high MVM and inference accuracy without application-wise re-tuning of the chip. Here, we present a multi-core AIMC chip designed and fabricated in 14-nm complementary metal-oxide-semiconductor (CMOS) technology with backend-integrated phase-change memory (PCM). The fully-integrated chip features 64 256x256 AIMC cores interconnected via an on-chip communication network. It also implements the digital activation functions and processing involved in ResNet convolutional neural networks and long short-term memory (LSTM) networks. We demonstrate near software-equivalent inference accuracy with ResNet and LSTM networks while implementing all the computations associated with the weight layers and the activation functions on-chip. The chip can achieve a maximal throughput of 63.1 TOPS at an energy efficiency of 9.76 TOPS/W for 8-bit input/output matrix-vector multiplications.
Paper Structure (7 sections, 2 equations, 10 figures, 3 tables)

This paper contains 7 sections, 2 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: IBM HERMES Project Chip overview.a, Electronic design automation snapshot and inset showing a micrograph of the chip. Therein, the outline of the 64 cores can be recognized as well as the array of 5,616 pads. b, Schematic overview of the different components on the multi-core chip. c, Schematic overview of a single PCM-based in-memory compute core. (1) PCM crossbar array, (2) current DAC-based programming unit, (3) PWM-based input modulator, (4) left and right ADC arrays, (5) local digital processing unit (LDPU), (6) left and right ADC register arrays, (7) left and right ADC convert and scale blocks, (8) activation function block, (9) link controller. d, Block diagram of a global digital processing unit (GDPU) used for LSTM-related data processing. Inputs to and outputs from the GDPU slice are in an 8-bit signed integer format (INT8). By using custom conversion blocks marked by i2f and f2i, INT8 values can be converted into FP16 and vice versa. Additionally, conversions at input/output can encompass a per-gate/per-output scale and bias operation using the FMA units. The inputs from the I, A, F, and O BLs are time multiplexed, and a single block is used to compute the gates' activation vectors. The sigmoid activation function for the I, F and O gates is computed by scaling and offsetting the output of the hyperbolic tangent function with the third (from the top) FMA unit, according to the identity $\mathrm{sigmoid}(x)=1/2 + 1/2 \cdot \tanh(x/2)$.
  • Figure 1: Digital communication fabric. a, Schematic of link controller. The dotted bounding box refers to the core boundary. b, Possible link connections for Core(3,5) and Core(4,5), where the notation Core($r$,$c$) refers to the core located at row $r$ and column $c$ in Fig. \ref{['fig:1']}b. c, Link connections for the entire chip (available connections are denoted in green color). The RX and TX connections for Core(3,5) and Core(4,5) shown in b are indicated.
  • Figure 2: MVM characterization. a, Unit-cell SET and RESET distributions of the 64 cores. Shades of blue and green denote different cores. The dashed line represents the one-device programming (ODP) $G\textsubscript{max}$, defined as the tenth percentile of the core with the least conductive SET states. The inset shows the unit-cell yield of the 64 cores. The yield condition is that the unit-cell can be programmed to $|G\textsubscript{RESET}|<5$ and $G\textsubscript{SET}>50$. b, Error distributions of $\epsilon\textsubscript{total}$, $\epsilon\textsubscript{linear}$ and $\epsilon\textsubscript{residual}$ for the 64 cores in LDPU (int8) units. A uniformly distributed weight matrix with 30% sparsity is programmed on each core, and 2,048 input vectors uniformly distributed with 30% sparsity are then sent to each core to perform the MVMs. The reduction of $\epsilon\textsubscript{total}$ achieved with two-device programming (TDP) can be attributed to a reduction of $\epsilon\textsubscript{linear}$. The distributions are computed over all error vector elements of 2,048 MVMs performed with all 64 cores. The inset shows the measured MVM results of one core for ODP and TDP against the ideal MVMs computed in software. c, Weight error $\mathrm{std}(W-\hat{W})/W\textsubscript{max}$, where $\mathrm{std}(W)$ is the standard deviation computed over all elements of $W$, as a function of target weight for ODP and TDP. The error bars represent one standard deviation over the 64 cores. d, 2-norm of $\epsilon\textsubscript{total}$, $\epsilon\textsubscript{linear}$ and $\epsilon\textsubscript{residual}$, normalized by $|| y\textsubscript{fp} ||_2$, as a function of time for ODP and TDP. Due to temporal conductance drift of PCM devices, $\epsilon\textsubscript{total}$ and $\epsilon\textsubscript{linear}$ increase gradually. The dashed lines represent the error achieved by a digital engine with 8-bit input/output precision and $N$-bit weight precision. The error bars represent one standard deviation over the 64 cores.
  • Figure 2: PCM crossbar array. a, Schematic of 8T4R unit-cell. The top electrodes of the conductance pairs of each polarity connect to separate bit lines $BL_{m}^{+}$, $BL_{m}^{-}$ and the sources of their lower access-transistors connect to separate source lines $SL_{n}^{+}$, $SL_{n}^{-}$. Thus, the devices in a conductance pair are weighted with equal significance and the total conductance per unit-cell becomes: $\left(g_{1}^{+}+g_{2}^{+}\right)-\left(g_{1}^{-}+g_{2}^{-}\right)$. b, Schematic of PCM crossbar array. To program the PCM devices, the dedicated per-core programming FSM instructs the diagonal selection decoder to enable one diagonal of cells that contains the devices that are to be programmed. The diagonal selection decoder controls the ${SEL}_{m,n}^{1}$ and ${SEL}_{m,n}^{2}$ signals in the unit-cell, which are routed diagonally throughout the array. The selected devices are programmed by the current-steering DAC-based programming units located on top of the PCM array. To perform an MVM, the 256 inputs to the crossbar array ($IN_0-IN_{255}$) are applied via the red source lines (SLs) to the 8T4R cells. The resulting bit line (BL) currents are summed up on the blue wires and read by the ADCs that flank the crossbar array on the left and right. c, Layout of one ADC. The block diagram that is shown below the layout illustrates the various components of the ADC, namely, the read voltage regulator, the current-to-frequency converter, and the $2\times{}$12-bit ripple counter.
  • Figure 3: ResNet-9 on CIFAR-10 measurement results. a, Network architecture. ResNet-9 comprises 8 convolution layers and 1 dense classification layer. Each convolution layer is followed by batch normalization and ReLU activation. Two residual connections from Conv1 to Conv3 and Conv5 to conv7, are present. Four max-pooling layers, implemented off-chip, are used to reduce the size of the input volume along network depth. b, Mapping of ResNet-9 onto the chip. Weights of all layers are programmed onto 40 cores. Conv0-4 and the dense layer are implemented on the first two rows of cores, whereas Conv5-7 each span an entire row. On-chip links connect the LDPUs of the cores within a layer to realize on-chip data aggregation. Batch normalization, ReLU, residual and bias additions are implemented in the LDPUs. c, Measured test accuracy results on CIFAR-10 benchmark for ODP and TDP compared with software baseline and simulation results that include the hardware-measured weight noise and quantization from the PWM and LDPU. The error bars represent one standard deviation over 10 inference runs.
  • ...and 5 more figures