Table of Contents
Fetching ...

Multiplier-free In-Memory Vector-Matrix Multiplication Using Distributed Arithmetic

Felix Zeller, John Reuben, Dietmar Fey

TL;DR

This paper tackles the data-movement energy problem in neural network VMM by introducing multiplier-free in-memory VMM using Distributed Arithmetic (DA). By precomputing and storing weight-sums and evaluating the input vector bit-serially with peripheral adders around ReRAM, the approach eliminates the need for power-hungry ADCs/DACs and delivers low latency. Functionally validated by transistor-level simulations, the method demonstrates 88 ns latency and 110 pJ per VMM, with a one-time weight-sum write cost amortized over many inferences; it also outperforms bit-slicing by about 4.5× in latency and 12× in energy. The solution is scalable (through array partitioning) and relies on simple sense-amplifier readout, offering a practical path to energy-efficient in-memory CNN inference hardware. Y = X^T W = \sum_i X_i W_i is computed without multipliers, leveraging fixed weights in memory and bit-serial input processing.

Abstract

Vector-Matrix Multiplication (VMM) is the fundamental and frequently required computation in inference of Neural Networks (NN). Due to the large data movement required during inference, VMM can benefit greatly from in-memory computing. However, ADC/DACs required for in-memory VMM consume significant power and area. `Distributed Arithmetic (DA)', a technique in computer architecture prevalent in 1980s was used to achieve inner product or dot product of two vectors without using a hard-wired multiplier when one of the vectors is a constant. In this work, we extend the DA technique to multiply an input vector with a constant matrix. By storing the sum of the weights in memory, DA achieves VMM using shift-and-add circuits in the periphery of ReRAM memory. We verify functional and also estimate non-functional properties (latency, energy, area) by performing transistor-level simulations. Using energy-efficient sensing and fine grained pipelining, our approach achieves 4.5 x less latency and 12 x less energy than VMM performed in memory conventionally by bit slicing. Furthermore, DA completely eliminated the need for power-hungry ADCs which are the main source of area and energy consumption in the current VMM implementations in memory.

Multiplier-free In-Memory Vector-Matrix Multiplication Using Distributed Arithmetic

TL;DR

This paper tackles the data-movement energy problem in neural network VMM by introducing multiplier-free in-memory VMM using Distributed Arithmetic (DA). By precomputing and storing weight-sums and evaluating the input vector bit-serially with peripheral adders around ReRAM, the approach eliminates the need for power-hungry ADCs/DACs and delivers low latency. Functionally validated by transistor-level simulations, the method demonstrates 88 ns latency and 110 pJ per VMM, with a one-time weight-sum write cost amortized over many inferences; it also outperforms bit-slicing by about 4.5× in latency and 12× in energy. The solution is scalable (through array partitioning) and relies on simple sense-amplifier readout, offering a practical path to energy-efficient in-memory CNN inference hardware. Y = X^T W = \sum_i X_i W_i is computed without multipliers, leveraging fixed weights in memory and bit-serial input processing.

Abstract

Vector-Matrix Multiplication (VMM) is the fundamental and frequently required computation in inference of Neural Networks (NN). Due to the large data movement required during inference, VMM can benefit greatly from in-memory computing. However, ADC/DACs required for in-memory VMM consume significant power and area. `Distributed Arithmetic (DA)', a technique in computer architecture prevalent in 1980s was used to achieve inner product or dot product of two vectors without using a hard-wired multiplier when one of the vectors is a constant. In this work, we extend the DA technique to multiply an input vector with a constant matrix. By storing the sum of the weights in memory, DA achieves VMM using shift-and-add circuits in the periphery of ReRAM memory. We verify functional and also estimate non-functional properties (latency, energy, area) by performing transistor-level simulations. Using energy-efficient sensing and fine grained pipelining, our approach achieves 4.5 x less latency and 12 x less energy than VMM performed in memory conventionally by bit slicing. Furthermore, DA completely eliminated the need for power-hungry ADCs which are the main source of area and energy consumption in the current VMM implementations in memory.

Paper Structure

This paper contains 12 sections, 1 equation, 10 figures, 1 table.

Figures (10)

  • Figure 1: Simplified illustration of Distributed Arithmetic
  • Figure 2: The weights of the full connected NN forms the matrix $W$ and the inputs form vector $X$. The sum of the weights is written into the processing memory and $X$ is applied in a bit-serial manner.
  • Figure 3: Illustration of mapping of the frist convolutional layers of LeNet-5 to Vector and Matrices. $If_{1}$ and $Of_{1,i}$ represent the input feature maps and output feature maps of convolutional layer 1. Each stride of the convolution becomes a VMM.
  • Figure 4: A VMM with a 8x8 matrix can be implemented in memory in 8 cycles
  • Figure 5: Scaling of our approach as the weight matrix scales from 8$\times$8 to 16$\times$16. The 16$\times$16 weight matrix is sliced into two 8$\times$16 matrix and their sum written to two processing arrays.
  • ...and 5 more figures