Table of Contents
Fetching ...

Neural Precision Polarization: Simplifying Neural Network Inference with Dual-Level Precision

Dinithi Jayasuriya, Nastaran Darabi, Maeesha Binte Hashem, Amit Ranjan Trivedi

Abstract

We introduce a precision polarization scheme for DNN inference that utilizes only very low and very high precision levels, assigning low precision to the majority of network weights and activations while reserving high precision paths for targeted error compensation. This separation allows for distinct optimization of each precision level, thereby reducing memory and computation demands without compromising model accuracy. In the discussed approach, a floating-point model can be trained in the cloud and then downloaded to an edge device, where network weights and activations are directly quantized to meet the edge devices' desired level, such as NF4 or INT8. To address accuracy loss from quantization, surrogate paths are introduced, leveraging low-rank approximations on a layer-by-layer basis. These paths are trained with a sensitivity-based metric on minimal training data to recover accuracy loss under quantization as well as due to process variability, such as when the main prediction path is implemented using analog acceleration. Our simulation results show that neural precision polarization enables approximately 464 TOPS per Watt MAC efficiency and reliability by integrating rank-8 error recovery paths with highly efficient, though potentially unreliable, bit plane-wise compute-in-memory processing.

Neural Precision Polarization: Simplifying Neural Network Inference with Dual-Level Precision

Abstract

We introduce a precision polarization scheme for DNN inference that utilizes only very low and very high precision levels, assigning low precision to the majority of network weights and activations while reserving high precision paths for targeted error compensation. This separation allows for distinct optimization of each precision level, thereby reducing memory and computation demands without compromising model accuracy. In the discussed approach, a floating-point model can be trained in the cloud and then downloaded to an edge device, where network weights and activations are directly quantized to meet the edge devices' desired level, such as NF4 or INT8. To address accuracy loss from quantization, surrogate paths are introduced, leveraging low-rank approximations on a layer-by-layer basis. These paths are trained with a sensitivity-based metric on minimal training data to recover accuracy loss under quantization as well as due to process variability, such as when the main prediction path is implemented using analog acceleration. Our simulation results show that neural precision polarization enables approximately 464 TOPS per Watt MAC efficiency and reliability by integrating rank-8 error recovery paths with highly efficient, though potentially unreliable, bit plane-wise compute-in-memory processing.

Paper Structure

This paper contains 8 sections, 6 figures.

Figures (6)

  • Figure 1: Overview of Neural Precision Polarization (NPP):(a) Neural precision polarization employs only two quantization levels, with most weights at ultra-low precision (e.g., FP4 or N4) and selective high-precision surrogate paths to mitigate accuracy loss. This dual-level approach enables dedicated optimization and simplified implementation of network inference. (b) Under NPP, a cloud-trained floating-point model is downloaded to the edge, with weights and activations quantized to meet the edge. Layer-wise surrogate paths with low-rank approximations and sensitivity-based metrics to recover accuracy with minimal overhead and retraining data. Compute-in-memory processing of low-rank paths is discussed.
  • Figure 2: Quantization Error Compensation with NPP: LoRA fine-tuning recovers quantization accuracy loss in ViT models on (a) CIFAR100 and (b) ImageNet datasets, respectively with recovery varying by rank.
  • Figure 3: (a) Accuracy degradation with process variability-induced weight perturbations and compensation with low rank tuning. Error compensation at varying (b) rank and (c) depth. (d) Number of surrogate parameters at various rank/depth.
  • Figure 4: Impact of Sensitivity Ranking and Sample Selection on Accuracy: (a) Sensitivity ranking used to select the most sensitive samples for each class. Accuracy slightly increases with the number of samples. (b) Accuracy variation with sample count across different ranks in Resnet50.
  • Figure 5: The NPP approach enhances efficiency by utilizing highly energy-efficient, though potentially unreliable, building blocks for reliable processing. (a) shows an in-memory computing unit with bit-plane-wise separation, removing the need for ADCs/DACs and achieving high energy efficiency through aggressive quantization. To compensate for processing errors, FP32 precision surrogate paths are integrated using digital building blocks, as shown in (b).
  • ...and 1 more figures