Table of Contents
Fetching ...

MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization

Akshat Ramachandran, Souvik Kundu, Tushar Krishna

TL;DR

MicroScopiQ tackles the outsized challenge of quantizing foundational models by jointly applying Hessian-guided pruning with outlier-aware quantization, enabling outliers to be encoded with higher precision while surrounding inliers use a consistent, hardware-friendly data format. It introduces MX-FP for outliers and MX-INT for inliers, plus a novel ReCoN NoC to redistribute and coordinate outlier partial sums, all implemented on a multi-precision PE array. The approach achieves state-of-the-art quantization accuracy across LLMs, VLMs, CNNs, and SSMs, with up to 3x faster inference and 2x lower energy relative to prior methods, and EBW as low as approximately 2.36 bits on average. These results demonstrate a practical path to high-accuracy, energy-efficient FM inference on specialized accelerators and support integration in GPUs, offering broad impact for deployable, quantized models.

Abstract

Quantization of foundational models (FMs) is significantly more challenging than traditional DNNs due to the emergence of large magnitude values called outliers. Existing outlier-aware algorithm-architecture co-design techniques either use mixed-precision, retaining outliers at high precision but compromise hardware efficiency, or quantize inliers and outliers at the same precision, improving hardware efficiency at the cost of accuracy. To address this mutual exclusivity, we propose MicroScopiQ, a novel co-design technique that leverages pruning to complement outlier-aware quantization. MicroScopiQ retains outliers at higher precision while pruning a certain fraction of least important weights to distribute the additional outlier bits; ensuring high accuracy, aligned memory and hardware efficiency. We design a high-throughput, low overhead accelerator architecture composed of multi-precision INT processing elements and a network-on-chip called ReCoN that efficiently abstracts the complexity of supporting high-precision outliers. Additionally, unlike prior techniques, MicroScopiQ does not assume any locality of outlier weights, enabling applicability to a broad range of FMs. Extensive experiments across diverse quantization settings demonstrate that MicroScopiQ achieves state-of-the-art quantization accuracy, while delivering up to 3x faster inference and 2x lower energy consumption compared to existing alternatives. Code is available at: https://github.com/georgia-tech-synergy-lab/MicroScopiQ-LLM-Quantization

MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization

TL;DR

MicroScopiQ tackles the outsized challenge of quantizing foundational models by jointly applying Hessian-guided pruning with outlier-aware quantization, enabling outliers to be encoded with higher precision while surrounding inliers use a consistent, hardware-friendly data format. It introduces MX-FP for outliers and MX-INT for inliers, plus a novel ReCoN NoC to redistribute and coordinate outlier partial sums, all implemented on a multi-precision PE array. The approach achieves state-of-the-art quantization accuracy across LLMs, VLMs, CNNs, and SSMs, with up to 3x faster inference and 2x lower energy relative to prior methods, and EBW as low as approximately 2.36 bits on average. These results demonstrate a practical path to high-accuracy, energy-efficient FM inference on specialized accelerators and support integration in GPUs, offering broad impact for deployable, quantized models.

Abstract

Quantization of foundational models (FMs) is significantly more challenging than traditional DNNs due to the emergence of large magnitude values called outliers. Existing outlier-aware algorithm-architecture co-design techniques either use mixed-precision, retaining outliers at high precision but compromise hardware efficiency, or quantize inliers and outliers at the same precision, improving hardware efficiency at the cost of accuracy. To address this mutual exclusivity, we propose MicroScopiQ, a novel co-design technique that leverages pruning to complement outlier-aware quantization. MicroScopiQ retains outliers at higher precision while pruning a certain fraction of least important weights to distribute the additional outlier bits; ensuring high accuracy, aligned memory and hardware efficiency. We design a high-throughput, low overhead accelerator architecture composed of multi-precision INT processing elements and a network-on-chip called ReCoN that efficiently abstracts the complexity of supporting high-precision outliers. Additionally, unlike prior techniques, MicroScopiQ does not assume any locality of outlier weights, enabling applicability to a broad range of FMs. Extensive experiments across diverse quantization settings demonstrate that MicroScopiQ achieves state-of-the-art quantization accuracy, while delivering up to 3x faster inference and 2x lower energy consumption compared to existing alternatives. Code is available at: https://github.com/georgia-tech-synergy-lab/MicroScopiQ-LLM-Quantization

Paper Structure

This paper contains 35 sections, 5 equations, 18 figures, 8 tables, 1 algorithm.

Figures (18)

  • Figure 1: Depiction of MX-FP data format with level-1 scale factor and level-2 microExponent ($\mu X$), with $k_1$ and $k_2$ the group sizes over which these two factors are shared.
  • Figure 2: (a) Layer-wise distribution of outliers and adjacent outliers as a percentage of total number of weights, (b) Quantization accuracy comparison between OliVe-W4A16 and MicroScopiQ-W2A16 on various benchmarks.
  • Figure 3: (a) Overview of the proposed MicroScopiQ quantization framework depicting methodology of inlier and outlier quantization and redistribution of outlier bits for a sample LLM weight matrix. Comparison against prior quantization frameworks (b) GOBO, and (c) OliVe.
  • Figure 4: Integration of MicroScopiQ into a WS systolic array.
  • Figure 5: MicroScopiQ memory organization.
  • ...and 13 more figures