Table of Contents
Fetching ...

A Precision-Optimized Fixed-Point Near-Memory Digital Processing Unit for Analog In-Memory Computing

Elena Ferro, Athanasios Vasilopoulos, Corey Lammie, Manuel Le Gallo, Luca Benini, Irem Boybat, Abu Sebastian

TL;DR

This work tackles the digital post-processing bottleneck in Analog In-Memory Computing by introducing a fixed-point Near-Memory Processing Unit (NMPU) that performs affine correction and activation-related operations near memory. The authors develop a two-branch, fixed-point datapath with carefully chosen bit-widths (scale (1,7) unsigned, offset (7,1) signed) and a precision-audited rounding/truncation scheme, and they validate it through cycle-accurate simulations and chip data. Synthesis in 14 nm CMOS demonstrates substantial area savings (≈505×) and enables high parallelism (64 units across 4 columns), delivering a 139× speed-up over a prior FP16-based design with only minor accuracy losses on ResNet9/ResNet32 for CIFAR10/ CIFAR100 (≈0.12% and ≈0.40% drops). The result is a compact, high-throughput near-memory digital block that preserves AIMC efficiency while enabling scalable DL inference with standard activations like ReLU and BatchNorm.

Abstract

Analog In-Memory Computing (AIMC) is an emerging technology for fast and energy-efficient Deep Learning (DL) inference. However, a certain amount of digital post-processing is required to deal with circuit mismatches and non-idealities associated with the memory devices. Efficient near-memory digital logic is critical to retain the high area/energy efficiency and low latency of AIMC. Existing systems adopt Floating Point 16 (FP16) arithmetic with limited parallelization capability and high latency. To overcome these limitations, we propose a Near-Memory digital Processing Unit (NMPU) based on fixed-point arithmetic. It achieves competitive accuracy and higher computing throughput than previous approaches while minimizing the area overhead. Moreover, the NMPU supports standard DL activation steps, such as ReLU and Batch Normalization. We perform a physical implementation of the NMPU design in a 14 nm CMOS technology and provide detailed performance, power, and area assessments. We validate the efficacy of the NMPU by using data from an AIMC chip and demonstrate that a simulated AIMC system with the proposed NMPU outperforms existing FP16-based implementations, providing 139$\times$ speed-up, 7.8$\times$ smaller area, and a competitive power consumption. Additionally, our approach achieves an inference accuracy of 86.65 %/65.06 %, with an accuracy drop of just 0.12 %/0.4 % compared to the FP16 baseline when benchmarked with ResNet9/ResNet32 networks trained on the CIFAR10/CIFAR100 datasets, respectively.

A Precision-Optimized Fixed-Point Near-Memory Digital Processing Unit for Analog In-Memory Computing

TL;DR

This work tackles the digital post-processing bottleneck in Analog In-Memory Computing by introducing a fixed-point Near-Memory Processing Unit (NMPU) that performs affine correction and activation-related operations near memory. The authors develop a two-branch, fixed-point datapath with carefully chosen bit-widths (scale (1,7) unsigned, offset (7,1) signed) and a precision-audited rounding/truncation scheme, and they validate it through cycle-accurate simulations and chip data. Synthesis in 14 nm CMOS demonstrates substantial area savings (≈505×) and enables high parallelism (64 units across 4 columns), delivering a 139× speed-up over a prior FP16-based design with only minor accuracy losses on ResNet9/ResNet32 for CIFAR10/ CIFAR100 (≈0.12% and ≈0.40% drops). The result is a compact, high-throughput near-memory digital block that preserves AIMC efficiency while enabling scalable DL inference with standard activations like ReLU and BatchNorm.

Abstract

Analog In-Memory Computing (AIMC) is an emerging technology for fast and energy-efficient Deep Learning (DL) inference. However, a certain amount of digital post-processing is required to deal with circuit mismatches and non-idealities associated with the memory devices. Efficient near-memory digital logic is critical to retain the high area/energy efficiency and low latency of AIMC. Existing systems adopt Floating Point 16 (FP16) arithmetic with limited parallelization capability and high latency. To overcome these limitations, we propose a Near-Memory digital Processing Unit (NMPU) based on fixed-point arithmetic. It achieves competitive accuracy and higher computing throughput than previous approaches while minimizing the area overhead. Moreover, the NMPU supports standard DL activation steps, such as ReLU and Batch Normalization. We perform a physical implementation of the NMPU design in a 14 nm CMOS technology and provide detailed performance, power, and area assessments. We validate the efficacy of the NMPU by using data from an AIMC chip and demonstrate that a simulated AIMC system with the proposed NMPU outperforms existing FP16-based implementations, providing 139 speed-up, 7.8 smaller area, and a competitive power consumption. Additionally, our approach achieves an inference accuracy of 86.65 %/65.06 %, with an accuracy drop of just 0.12 %/0.4 % compared to the FP16 baseline when benchmarked with ResNet9/ResNet32 networks trained on the CIFAR10/CIFAR100 datasets, respectively.
Paper Structure (6 sections, 4 equations, 6 figures, 1 table)

This paper contains 6 sections, 4 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: (a) Sample tile consisting of a 256$\times$256 PCM crossbar, DAC block, 256 ADCs and 64 NMPUs; (b) flow diagram of the proposed fixed-point NMPU.
  • Figure 2: Set of 256 transfer curves: after calibration in purple; and after the affine correction in orange.
  • Figure 3: Quantization error for the 15 explored adopting different first and second cut/round stages.
  • Figure 4: Relative quantization error on ResNet9 layers.
  • Figure 5: Inference accuracy for ResNet9 and ResNet32 for different architectures. The error bars represent the standard deviation obtained from 10 repetitions.
  • ...and 1 more figures